Support for High Performance Scientific and Engineering ...cse.stfx.ca/~ltyang/pact03-spdsec/informal-proceedings.pdf · Messages from the PACT-SHPSEC03 Workshop Chairs Welcome to

The 2nd Workshop on Hardware/Software Support for High Performance Scientific and

Engineering Computing

SHPSEC-03

New Orleans, Louisiana, Sep. 27--Oct. 1, 2003

in conjunction with

The 12th International Conference on Parallel Architectures and Compilation Techniques (PACT-03)

Messages from the PACT-SHPSEC03 Workshop Chairs

Welcome to the 2nd International Workshop on Hardware/Software Support

for High Performance Scientific and Engineering Computing (SHPSEC-03)

held in conjunction with the 12th International Conference on Parallel

Architectures and Compiler Techniques (PACT-03) in New Orleans,

Louisiana, USA, Sep. 27-Oct. 1, 2003.

The field of parallel and distributed processing has obtained

prominence through advances in electronic and integrated technologies

beginning in the 1940s. Current times are very exciting and the years

to come will witness a proliferation in the use of parallel and

distributed systems, or supercomputers. The scientific and

engineering application domains have a key role in shaping future

research and development activities in academia and industry.

The purpose of this workshop is to provide an open forum for computer

scientists, computational scientists and engineers, applied

mathematicians, and researchers to present, discuss, and exchange

research-related ideas, results, works-in-progress, and experiences

in the areas of architectural, compilation, and language support for

problems in science and engineering applications.

Possible topics include, but are not limited to:

• Parallel and distributed architectures and computation models

• Hardware/software co-design Compilers, hardware, tools,

programming languages and algorithms, operation system support

and I/O issues

• Techniques in the context of JAVA (Java multi-threading, Java

processors with parallelism, novel compiler optimizations for

Java, etc.)

• Compiler/hardware support for efficient memory systems

• Novel uses for threads to improve performance or power

• Software dynamic translation and modification

• Hardware/software and technological support for energy-aware

applications

• Network, Mobile/wireless processing and computing

• Software processes for development/maintenance/evolution of

scientific and engineering software

• Cluster and grid computing

• The application areas are to be understood very broadly and

include, but are not limited to, computational fluid dynamics

and mechanics, material sciences, space, weather, climate

systems and global changes, computational environment and energy

systems, computational ocean and earth sciences, combustion

system simulation, computational chemistry, computational

physics, bioinformatics and computational biology, medical

applications, transportation systems simulations,

combinatorial and global optimization problems, structural

engineering, computational electro-magnetics, data mining,

computer graphics, virtual reality and multimedia,

computational finance, semiconductor technology, and

electronic circuits and system design, signal and image

processing, and dynamic systems, etc.

There were a very large number of paper submissions from not only

the Asian Pacific, but also Europe and North America. All

submissions were reviewed by at least three committee members or

external reviewers. It is extremely difficult to select the

presentation on the workshop because there were many excellent and

interesting submissions. In order to allocate as many papers as

possible and keep the high quality of the workshop, we finally decide

to accept 25 papers for oral technical presentation on the workshop.

We believe all of these papers and topics will not only provide novel

ideas, new results, work in progress and state-of-the-art techniques

in this field and but also stimulate the future research activities in

the area of hardware/software support for high performance computing

in science and engineering applications.

The program for this workshop is the result of hard and excellent work

of many others, reviewers and program committee members. We would like

to express our sincere appreciation to all authors for their valuable

contributions and to all program committee members and external

reviewers for their cooperation in completing the workshop program

under a very tight schedule. Last but not least, we thank Dr. Martin

Schulz and Prof. J. Ramanujam, the workshops chairs of PACT-03, for

their patience and effort for helping and encouraging the inclusion

of SHPSEC-03 in PACT-03.

SHPSEC-02 Program Chairs

Laurence Tianruo Yang, St. Francis Xavier University, Canada

Minyi Guo, University of Aizu, Japan

SHPSEC-02 Program Vice-Chairs

Martin Buecker, Aachen University of Technology, Germany

Jiannong Cao, The Hong Kong Polytechnic University, P. R. China

Weng-Long Chang, Southern Taiwan University of Technology, Taiwan

Vipin Chaudhary, Wayne State University, USA

Beniamino Di Martino, Second University of Naples, Italy

Weijia Jia, City University of Hong Kong, P. R. China

Darren J. Kerbyson, Los Alamos National Lab., USA

Jie Li, University of Tsukuba, Japan

Thomas Rauber, University of Bayreuth, Germany

Gudula Runger, Technical University of Chemnitz, Germany

Technical and Program Committee:

Hamid R. Arabnia University of Georgia, USA

Ioana Banicescu Mississippi State University, USA

Subhash Bhalla University of Aizu, Japan

Anu Bourgeois Georgia State University, USA

Martin Buecker Aachen University of Technology, Germany

J. Carlos

Cabaleiro Universidade Santiago de Compostela, Spain

Kirk W. Cameron University of South Carolina, USA

Jiannong Cao The Hong Kong Polytechnic University, P. R.

China

Kasidit Chanchio Oak Ridge National Laboratory, USA

Weng-Long Chang Southern Taiwan University of Technology,

Taiwan

Barbarra Chapman University of Houston, USA

Vipin Chaudhary Wayne State University, USA

Beniamino Di

Martino Second University of Naples, Italy

Ramon Doallo Universidade da Coruna, Spain

Rudolf Eigenmann Purdue University, USA

Thomas Fahringer University of Vienna, Austria

Len Freeman University of Manchester, UK

Michael Gerndt Technical University of Munich, Germany

John L. Gustafson Sun Microsystems, Inc. USA

Dora B. Heras Universidade Santiago de Compostela, Spain

Pao-Ann Hsiung National ChungCheng University, Taiwan,

R.O.C

Constantinos

Ierotheou University of Greenwich, UK

Weijia Jia City University of Hong Kong, P. R. China

Hai Jin Huazhong University of Science and

Technology, P.R. China

D.J. Kerbyson Los Alamos National Laboratory, USA

Zhiling Lan Illinois Institute of Technology, USA

Jie Li University of Tsukuba, Japan

Sanli Li Tsinghua University, P. R. China

Zhiyuan Li Purdue University, USA

Xiaola Lin University of Hong Kong, P. R, China

Yong Luo Intel, USA

Russ Miller State University of New York at Buffalo, USA

Frank Mueller North Carolina State University, USA

Michael K. Ng University of Hong Kong, P. R. China

John O'Donnell University of Glasgow, UK

Stephan Olariu Old Dominion University, USA

Mohamed

Ould-Khaoua University of Glasgow, Scotland

Tomas F. Pena Universidade Santiago de Compostela, Spain

Janusz Pipin National Research Council of Canada, Canada

Constantine

Polychronopoulos University of Illinois at

Urbana-Champaign, USA

Xiangzhen Qiao Chinese Academy of Science, P.R. China

Ram J Ramanujam Louisiana State University, USA

Yi Pan Georgia State University, USA

Thomas Rauber University of Bayreuth, Germany

Jose D. P. Rolim University of Geneva, Switzerland

Gudula Runger Technical University of Chemnitz, Germany

Olaf Schenk University of Basel, Switzerland

Erich Schikuta University of Vienna, Austria

Stanislav G.

Sedukhin University of Aizu, Japan

Edwin H-M. Sha University of Texas at Dallas, USA

Gurdip Singh Kansas State University, USA

Peter Strazdins Australian National University, Australia

Eric de Sturler University of Illinois at

Urbana-Champaign, USA

Sabin Tabirca University College Cork, Ireland

Luciano

Tarricone University of Perrugia, Italy

Xinmin Tian INTEL, USA

Mateo Valero Universidad Politecnica de Catalunya,

Spain

Chengzhong Xu Wayne State University, USA

Jie Wu Florida Atlantic University, USA

Zhiwei Xu National Center for Intelligent Computing

Systems, P. R. China

Jingling Xue The University of New South Wales,

Australia

Zahari Zlatev National Environmental Research Institute,

Denmark

Xiaodong Zhang College of William and Mary, USA

Yu Zhuang Texas Technical University, USA

Weimin Zheng Tsinghua University, P. R. China

Bingbing Zhou Deakin University, Australia

Wanlei Zhou Deakin University, Australia

Hans Zima

Jet Propulsion Laboratory, California

Institute of Technology

and University of Vienna, Austria

Albert Y. Zomaya University of Sydney, Australia

List of Reviewers: Rocco Aversa, Alessandra Budillon, Fan Chan, Yanqing Ji,

Hai Jiang, Hai Jin, Dora B. Heras, Sascha Hunold,

Guan-Joe Lai, Lidong Lin, Man Lin, John O'Donnell,

Zina Ben Miled, Gennadiy Nikishkov, Chang Woo Pyo,

Xiangzhen Qiao, Arno Rasch, Gianmarco Romano,

Robert R. Roxas, Gianpaolo Saggese, Erich Schikuta,

Stanislav G. Sedukhin, Emil Slusanschi, Oliviero Talamo,

Yuichi Tsujita, Tu Wanqing, Alexander P. Vazhenin,

Andre Vehreschild, Salvatore Venticinque, John Walters,

Bo-Wen Wang, Guojun Wang, Jiahong Wang, Andreas Wolf,

Xingfu Wu, Yusaku Yamamoto, Rentaro Yoshioka, Binbing Zhou,

Jingyang Zhou, Dakai Zhu

ADVANCED PROGRAM

SEPTEMBER 27, 2003

8:00-8:30 am Registration and Breakfast

8:30-10:10 am Session 1: System Architectures (1) Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors for Applications with a Small Working Set Sung Woo Chung, Hyong-Shik Kim and Chu Shik Jhon Samsung Electronics, Chungnam National University and Seoul National University, Korea Traditional versus Next-Generation Journaling File Systems: a Performance Comparison Approach Juan Piernas, Toni Cortes and Jose M. Garcia Universidad de Murcia and Universitat Politecnica de Catalunya, Spain Super-Fast Active Memory Operations-Based Synchronization Lixin Zhang, Zhen Fang, John B. Carter and Michael A. Parker, University of Utah, USA Dynamic Code Repositioning for Java Shinji Tanaka, Tetesuyasu Yamada, and Satoshi Shiraishi, Network Service Systems Laboratories, NTT, Japan 10:10-10:30 am Coffee Break 10:30-12:10 am Session 2: Numerical Computations Designing Polylibraries to Speed Up Linear Algebra Computations Pedro Alberti, Pedro Alonso, Antonio Vidal, Javier Cuenca and Domingo Gimenez Universidad Politecnica de Valencia and Universidad de Murcia, Spain A Parallel Implementation of Multi-domain High-order Navier-Stokes Equations Using MPI Hui Wang, Yi Pan, Minyi Guo, Anu G. Bourgeois, Da-Ming Wei University of Aizu, Japan and Georgia State University, USA Developing an Object-oriented Parallel Iterative-Methods Library Chakib Ouarraoui and David Kaeli, Northeastern University, USA A Super-Programming Techniques for Large Sparse Matrix Multiplication on PC Clusters Dejiang Jin and Sotirios G. Ziavras, New Jersey Institute of Technology, USA 12:10-1:30 pm Lunch Break 1:30-3:10 pm Session 3: Scientific and Engineering Applications A Scene Partitioning Method for Global Illumination Simulations on Clusters M. Amor, J. R. Sanjurjo, E. J. Padron, A. P. Guerra and R. Doallo, University of da Coruna, Spain Efficient Scheduling for Low-Power High-Performance DSP Applications Zili Shao, Qingfeng Zhuge, Youtao Zhang, Edwin H.-M. Sha, University of Texas at Dallas, USA Robust Transmission of Wavelet Video Sequence Using Intra Prediction Coding Scheme Joo-Kyong Lee, Ki-Dong Chung, Pusan National University, Korea A Novel Hierarchical Architecture for Communication Delay Enhancement in Large-Scale Information Systems Khaled Ragab, Naohiro Kaji, Khoirul Anwar, and Kinji Mori Tokyo Institute of Technology, Japan

3:10-3:30 pm Coffee Break 3:30-5:10 pm Session 4: Language and Computation Model Problems with Using MPI 1.1 and 2.0 as Compilation Targets for Parallel Language Implementations Dan Bonachea and Jason Duell, University of California at Berkeley, USA Fast Parallel Solution For Set-Packing and Clique Problems by DNA-based Computing Michael (Shan-Hui) Ho, Weng-Long Chang, Minyi Guo, and Laurence Tianruo Yang Southern Taiwan University of Technology, Taiwan, University of Aizu, Japan and St. Francis Xavier University, Canada HPM: A Hierarchical Model for Parallel Computations Xiangzhen Qiao, Shuqing Chen et al. Chinese Academy of Sciences, P. R. China A Visual Environment for Specifying Global Reduction Operations Robert R. Roxas and Nikolay N. Mirenkov, University of Aizu, Japan

SEPTEMBER 28, 2003 8:30-10:10 am Session 5: System Architectures (2) VLaTTe: A Java Just-in-Time Compiler for VLIW with Fast Scheduling and Register Allocation Suhyun Kim, Soo-Mook Moon, Kemal Ebcioglu and Erik Altman Seoul National University, Korea and IBM T. J. Watson Research Center, USA Speeding-up Multiprocessors Running DSS Workloads through Coherence Protocols Pierfrancesco Foglia, Roberto Giorgi, Cosimo Antonio Prete， Universita’ di Pisa, Italy Toward Optimization of OpenMP Codes for Synchronization and Data Reuse Tien-hsiung Weng and Barbara Chapman, University of Houston, USA Hierarchical Interconnection Networks Based on (3, 3)-Graphs for Massively Parallel Processors Gene Eu Jan, Yuan-Shin Hwang, Chung-Chih Luo and Deron Liang National Taiwan Ocean University, Taiwan 10:10-10:30 am Coffee Break 10:30-12:35 am Session 6: Distributed Computing High System Availability Using Neighbor Replication on Grid M. Mat Deris, A.Noraziah, M.Y. Saman, A.Noraida and Y.W.Yuan, University College Terengganu, Malaysia MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes Namyoon Woo, Heon Y. Yeom and Taesoon Park Seoul National University and Sejong University, Korea Omega Line Problem in a Consistent Global Recovery Line MaengSoon Baik, SungJin Choi, JoonMin Gil, ChanYeol Park, HyunChang Yoo and ChongSun Hwang, Korea University and Korea Institute of Science and Technology Information, Korea A Lightweight P2P Extension to ICP Ping-Jer Yeh, Yu-Chen Chuang, and Shyan-Ming Yuan National Chiao Tung University and Macronix International Co., Ltd, Taiwan Comparative Analysis of the Prototype and the Simulator of a New Load Balancing Algorithm for Heterogeneous Computing Environments Rodrigo F. De Mello, Luis C. Trevelin, Maria Stela V. De Paiva et al. University of São Paulo and Federal University of São Carlos, Brazil

Session 1: System Architectures (1)

1

Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors forApplications with a Small Working Set

Sung Woo Chung, Hyong-Shik Kim† and Chu Shik Jhon††[email protected] [email protected] [email protected]

Processor Architecture P/TSystem LSI Division

Device Solution NetworkSamsung Electronics Co., LTD.

Suwon-Si, 442-742 KOREA

Department of Computer Science†Chungnam National University

Taejon, 305-764 KOREA

School of Electrical Engineering and Computer Science††Seoul National UniversitySeoul, 151-742 KOREA

AbstractIn CC-NUMA multiprocessor systems, it is important to reduce the remote memory access time.

Based upon the fact that increasing the size of the LRU second-level (L2) cache more than a certainvalue does not reduce the cache miss rate significantly, in this paper, we propose two split L2 cachesto utilize the surplus of the L2 cache. The split L2 caches are composed of a traditional LRU cacheand another cache to reduce the remote memory access time. Both work together to reduce total L2cache miss time by keeping remote (or long-distance) blocks as well as recently used blocks. Foranother cache, we propose two alternatives : an L2-RVC (Level 2 - Remote Victim Cache) and an L2-DAVC (Level 2 - Distance-Aware Victim Cache). The proposed split L2 caches reduce total executiontime by up to 27%. It is also found that the proposed split L2 caches outperform the traditional singleLRU cache of double size.

Keywords : CC-NUMA, Cache Replacement Policy, On-Chip Cache, Split L2 Cache, Remote VictimCache, Distance-Aware Cache, Interconnection Network

1. IntroductionMost processors heavily rely on the cache to reduce the average memory access time. There have

been many studies on reducing the average cache access time, which depends not only on the cachemiss rate, but also on the cache miss penalty. Increasing the cache block size and prefetching thecache block were proposed to reduce the cache miss rate, while the non-blocking cache, the criticalword first cache, and the cache hierarchy were proposed to alleviate the cache miss penalty. Byinserting the second level (L2) cache between the first level (L1) cache and memory, which is calledcache hierarchy, we could capture many accesses that would go out to memory otherwise. Another

2

advantage of cache hierarchy is to allow the small first level (L1) cache to keep up with the clockfrequency of a processor.

A current trend in the design of the L2 cache is an increase of associativity and capacity. As theassociativity of the L2 cache increases to 4 or even 8, the replacement of blocks in the cache becomesan important issue for the overall performance. The LRU replacement algorithm has been known towork moderately well with caches. With the LRU replacement policy, however, an increase over acertain value in cache size does not reduce the cache miss rate much. As shown in [1], with LRUcache replacement policy in 32 processor systems, increasing the cache size over 256 KB does notreduce cache miss rate much in most SPLASH-2 applications. They observe knees around the cachesize of 256 KB in most application programs. It means that the memory access patterns are not helpfulwhen the cache size is larger than 256 KB cache with most of the SPLASH-2 applications.

The L1 cache size is generally limited to catch up with a processor clock. On the other hand, the L2cache is large enough to reduce the L1 cache miss penalty. How can we utilize the increasing capacityin L2 cache if 256 KB is sufficient for the applications? In fact, the sufficient capacity could be512KB, 1MB or more in the applications. However, there would be limits to performanceenhancement with the LRU cache (we will use the term “LRU cache” to indicate the LRU cache inthe second level exclusively for the sake of clear understanding, even though the L1 also runs an LRUreplacement policy) of more than a certain capacity. If increasing the L2 cache size does not increasethe cache hit rate significantly in the processors of the multiprocessor systems, how do we make useof the surplus of the L2 cache?

In a multiprocessor system, higher cache hit rate does not necessarily mean better performance.Completing a local memory block miss does not depend on the transmission latency on theinterconnection network, while completing a remote memory block does. Thus the average cache misstime can be reduced by the remote cache [7] or the RVC [5, 6] even though it does not reduce thecache miss rate.

There have been several studies to reduce the remote memory access time and traffic on theinterconnections by inserting an additional cache. The remote cache, which only stores the remotedata, was proposed [7]. R. Iyer and L.N. Bhuyan introduced Switch Cache [8]. The main idea was toimplement small fast caches in crossbar switches of the interconnect to capture and store shared dataas they flow from the memory module to the requesting processor. A. Moga and M. Dubois proposedthe SRAM network cache at the outside of the processor [5], which is organized as a victim cache forremote data. The major appeal of SRAM network caches is that they add less penalty on the latency ofthe RVC hits. The mentioned works were targeted at reducing the average cache miss time byreducing the remote memory access time. The RVC was already proposed in [5, 6]. Different from theorganization of the RVC in [5, 6], the L2-RVC in this study is located in the second level and is notshared by multiple processors.

There also have been studies on the split cache. The split L1 data cache in the uniprocessor systemswas proposed by A. Gonzalez et.al [9]. Their split cache consists of the cache for the spatial localityand the cache for the temporal locality. The former is a traditional cache based on the LRUreplacement policy. The latter behaves as a victim cache, which only stores the recently used block ofthe total cache line and also runs the LRU replacement policy. The split L1 data cache in themultiprocessor systems was also evaluated [10]. Their split caches are based on the LRU replacementpolicy, resulting in the reduction of the cache miss rate.

In this paper, we propose a split L2 cache, composed of an LRU cache and another cache for themultiprocessor systems. To reduce the remote memory access time, we suggest the RVC (Level 2 -

3

Remote Victim Cache) and the L2-DAVC (Level 2 - Distance-Aware Victim Cache) for anothercache. Another cache keeps remote (or long-distance) victim blocks, though they are needed even farin future. The L2-RVC or the L2-DAVC operates just like a victim cache proposed by Jouppi [7],since it keeps blocks that were discarded by the LRU cache. However, it differs from the originalvictim cache in a sense that it is as large as its victim-provider (LRU cache) and that it issimultaneously accessed with its victim-provider.

In the next section, we describe the structure of the proposed L2 cache in detail. Section 3 explainsthe simulation methodology with the application programs used in our simulations and examines thesimulation results. Section 4 concludes the paper.

2. Split L2 cache structureFigure 1 shows a block diagram of the memory hierarchy with the cache hierarchy emphasized. A

split L2 cache is composed of an LRU cache and another cache, which behaves just like a victimcache of the LRU cache. The LRU cache that runs the LRU replacement policy still favors recentlyused blocks, while another cache keeps remote (or long-distance) blocks. Another cache is yet anotherset-associative cache simultaneously accessed with the LRU cache. In order to utilize the temporallocality, the LRU cache is still needed in the second level of the cache hierarchy. Since the blockevicted from the L1 cache is sent to the L2 cache of the same node, we only have to consider thecache miss rate when selecting a victim in the L1 cache. Only when a block is replaced in the L2cache, we have to consider the cache miss time as well as the cache miss rate.

Figure 1. Block diagram of the memory hierarchy with the split L2 cache

4

In a traditional two-level cache hierarchy, all cache blocks of the L1 cache are included in the L2cache. This inclusion property makes it easy to maintain the memory hierarchy; however, it wastescache capacity by duplicating data. We do not maintain the inclusion property between the L1 cacheand the L2 cache as [3, 12, 13, 14]. Instead, we keep a duplicate copy of the tags and the states of theL1 cache in the L2 cache controllers. Since the advantages and the disadvantages of inclusionproperty is already described in [3, 12] and the inclusion property was not maintained in many off-the-shelf processors such as Intel Xeon [14] and Alpha 21264 [13], we do not explain the efficiency ofinclusion property any more in this paper.

We propose the L2-RVC and the L2-DAVC for another cache. The L2-RVC contains only remoteblocks and it uses the LRU replacement policy to find a victim among the remote blocks. Accordingly,a remote block of the L2-RVC can be evicted not by a local block, but by another remote block only.Thus, in the total L2 cache, the total number of remote blocks should be substantial compared to thatof local blocks. Hence, the proposed L2 cache is expected to reduce the total cache miss time byreducing the cache miss rate of the very remote blocks.

The other choice for another cache is the L2-DAVC. The L2-DAVC does not use the LRUreplacement policy. Instead, it runs the shortest distance replacement policy, in which the block whosehome node is shortest is replaced first. If more than one block in a set has the same distance, thevictim is chosen randomly among them. The problem of the L2-DAVC is that the block whosedistance between the home node and the current node is farthest may reside in the L2-DAVC for quitea long time. Nevertheless, as the L2-DAVC is combined with the LRU cache, there is still a highprobability of recently used blocks being kept. This means that the recently used blocks are notsacrificed for the L2-DAVC.

Figure2. Operations of the LRU/another cache

5

Upon a cache miss in the L1 cache, the cache controller accesses the LRU cache and another at thesame time. As far as the cache miss is concerned, we have four cases shown in Figure 2, excluding anL2 coherence miss. When a block is not found in the L1 cache but it is in the LRU cache (case a), avictim of the L1 cache is moved to the LRU cache, which may make another victim there and thefound block is used to fill the empty L1 cache block. In the case that another cache is the L2-RVC, thepossible new victim, if it is a remote block, is replaced into the L2-RVC. If it is a local block, thepossible new victim is evicted to the local memory because the L2-RVC stores only remote blocks. Inthe case that another cache is the L2-DAVC, the possible new victim is compared with the cacheblocks in the appropriate set of the L2-DAVC and the block of the shortest distance is replaced whichlets the L2-DAVC store local blocks. When the L1 missed block is found in another cache (case b),the victim of the L1 cache is replaced into the LRU cache. The subsequent replacements are same as(case a). If the block is not found in either the LRU cache or another cache (case c), the cachecontroller sends a memory access request to either the local memory or interconnection. Because ofno-inclusion property, an L1 cache miss, which can not be resolved either in the LRU cache or inanother cache, should be filled directly from local memory or interconnection without allocating ablock in the LRU cache or another cache. The subsequent replacements are same as (case a).

In case of an L1 cache coherence miss (case d), an invalidation is issued. We omit the operation ofan L2 cache coherence miss, which is a combination of the L1 cache coherence miss and the L2 cachehit.

3. Simulations3.1 Simulation environments

Increasing the cache size over 256 KB does not increase the cache hit rate much in most ofSPLASH-2 applications, as shown in [1]. They observed knees around the cache size of 256 KB inmost application programs. We admit that 256 KB may not be sufficient in other applications. 256 KBis just a sample value for SPLASH-2 applications. In this section we show the impact of the split L2cache when 256 KB is sufficient for the LRU portion of the L2 cache.

We use an execution-driven simulator to evaluate the performance of the split L2 cache. Thesimulator is modified from MINT [4]. MINT executes each instruction in one simulation cycle andsends the events generated by memory operations to the architecture simulator, which then performsthe operations necessary to simulate the system architecture.

Table 1. System parameters

Parameter ValueNumber of processors 16, 32Cache line size 32 BSize of L1 cache 64 KBL1 cache access time 1 processor clockL2 cache access time 6 processor clocksMemory access time 50 processor clocksDirectory (SRAM)

access time20 processor clocks

Transfer delay betweenadjacent nodes of the ring

64 processor clocks

6

The system parameters are shown in Table 1. The simulated systems are CC-NUMAmultiprocessors with 16 or 32 nodes. Each node has a processor with its associated caches and amemory. When the L2 cache is only composed of the LRU cache, it is 8-way set associative. Recallthat there is no inclusion between the LRU cache and another cache. Accordingly, to make a faircomparison, we assume that the LRU cache and another cache are all 4-way set associative in the splitL2 cache. We assume the point-to-point link delay from NUMALinks, which are used in SGI Origin3000 [11]. Thus, the transfer delay between adjacent nodes on the ring is 64 processor clocks which isobtained by (60 processor clocks (= 20ns = 32B/(1.6GB/s)) + 4 processor clocks (store-and-forwaredelay)) with 1.6GB/s unidirectional point-to-point links, 32B cache line size, and 3GHz processorclock speed. We maintain a 2-bit state field per cache block, corresponding to the MSI protocol,which is modified from a typical MESI protocol. The simulated system interconnects nodes by dual-ring style point-to-point links, using a full map directory protocol. We took eight applications –BARNES, CHOLESKY, FFT, FMM, LU, OCEAN, RADIX and WATER-NSQUARED from theSPLASH-2 benchmark suite [1]. The characteristics of the selected applications are summarized inTable 2.

3.2 Simulation resultsFirst, we show the simulation results when another cache is the L2-RVC. We assume that LRU + X

means that the L2 cache is composed of an LRU cache and another cache is X. We also compare theperformance of the LRU + L2-RVC to that of the LRU + L2-DAVC.

To avoid possible confusion, we first introduce some definitions. Pure transmission time on theinterconnection network, excluding the delay due to any shortage of resources, is called,“Transmission time on the interconnection network”. Waiting time for the resources on theinterconnection network is called, “Delay on the interconnection network”. A notation of X / Y isused, where X and Y are the size of the LRU cache and that of the L2-RVC or of the L2-DAVC,respectively

Application Type Problem SizeBARNES Barnes-Hut hierarchical N-body method 16K particlesCHOLESKY Block sparse Cholesky factorization Tk15.OFFT FFT computation 65536 complex doubles

FMMHierarchical N-body method called adaptive fastmultipole method

16K particles

LU Blocked dense linear algebra512X512 matrix, block16

OCEAN Study of ocean movements 258X258 ocean grid

RADIX Radix sort2M integer key, radix1K

WATER-NSQUARED

Water simulation using O(n2) algorithm 512 molecules

Table 2. The characteristics of the applications

7

We assume that the execution time is composed of CPU time, L2 cache access time, directoryaccess time, memory access time, transmission time on the interconnection network, and delay on theinterconnection network.

There are two important factors we must observe to evaluate the impact of the split L2 cache on theexecution time. One is the L1 cache hit rate, shown in Table 3, and the other is the hit count at anothercache. A low L1 cache hit rate makes the L2 cache handle many cache operations. Many hits atanother cache mean that remote misses (with the L2-RVC) or long-distance misses (with the L2-DAVC) are effectively removed (or replaced by local misses). If we could have many hits at anothercache with a low L1 hit cache rate, the impact should be significant.

Applications P=16 P=32BARNES 0.993 0.993CHOLESKY 0.965 0.958FFT 0.971 0.973FMM 0.993 0.993LU 0.985 0.992OCEAN 0.948 0.965RADIX 0.987 0.986WATER-NSQUARED 0.999 0.998

Table 3. L1 cache hit rate as to the number of processors in the system

3.2.1 LRU + L2-RVCIn the first simulations, we examine the impact of another cache size by varying the size of the L2-

RVC. Figure 3 and Figure 4 show the L2 cache hit rate and the normalized execution time for the 16processor system, respectively. (Figure 3~6 is shown at the end of this paper. We will resize the figuresand allocate them at appropriate position when accepted) Note that the hits at the LRU cache can beassumed to be the same as the hits of 256KB traditional L2 cache.

BARNES, FMM, LU, and WATER-NSQUARED have very high L1 cache hit rates as shown inTable 3 and have relatively high L2 cache hit rates, as shown in Figure 3. There is little differenceamong the cache hit rates of the 256KB, 512KB, and 1MB traditional L2 caches. Thus, adopting thesplit L2 cache in these applications brings performance enhancements all the time, since there is littlegain of the L2 hit rate from increasing the size of the LRU cache. Even though the split L2 cache hitrate is low, it is capable of reducing much more execution times than the traditional L2 cache sinceremote block misses cost more than local block misses do. As shown in Figure 4, any combination ofthe split L2 caches outperforms the 512KB traditional L2 cache and even outperforms the 1MBtraditional cache with the above applications. Except FMM, it seems that 64KB is sufficient for theL2-RVC.

Radix also has a high L1 cache hit rate. As shown in Figure 3, however, the difference between thehit rate of 256KB LRU cache and that of the 512K LRU cache is outstanding. While the L2 cache hitrate of the 256K/64K(A notation of X / Y is used, where X and Y are the size of the LRU cache andthat of the L2-RVC, respectively) and that of the 256K/128K are remarkably low, the execution timeof the 256K/64K and that of the 256K/128K are not so different from that of the 512K traditional L2cache. It is believed that the hits at the L2-RVC reduce the execution time more than the hits at theLRU cache.

8

It is very risky to use the L2-RVC with the applications such as FFT, OCEAN, and CHOLESKY,in which there is much difference among the cache hit rates of the traditional L2 cache of 256KB,512KB, and 1MB. However, if the gap between the cache hit rates is compensated by the L2-RVC,the performance enhancement might be significant. Though there are few hits at the L2-RVC inCHOLESKY, the degradation of the execution time is negligible since the local memory access isdominant.

As the size of the L2-RVC increases, the hit rate of the LRU cache also increases linearly in FFT.The improvement of the execution time is significant because the L2-RVC outperforms the LRUcache in the L2 cache hit rate. In FFT, the increase of the L2-RVC significantly affects the executiontime because of a low L1 cache hit rate and a low L2 cache hit rate.

The result of OCEAN is similar to that of FFT. Due to the large working set, we need large L2-RVC to compensate the low cache hit rate of the LRU cache. It is also the reason why OCEAN is theonly application where our split L2 cache does not outperform the 1M traditional LRU cache. Theresults are still encouraging, nonetheless, in that the 256K/256K split L2 cache works better than thesame-size 512K traditional L2 cache in both cache hit rate and execution time.

Figure 5 and Figure 6 depict similar results when the number of processors is increased to 32. Theperformance gains in Figure 6 are more than those in Figure 4. In case of OCEAN, 256K/512K L2split cache finally outperforms the 1MB traditional L2 cache in the execution time. As the number ofprocessors increases, the maximum miss penalty and the average cache miss penalty tend to beincreased. It is possible, therefore, to reduce the execution time much more with 32 processors, sincethe split L2 caches effectively reduce misses which may incur more penalties otherwise.

LRU cache size 512 KB 1 MBNumber of processors 16 32 16 32BARNES 320 KB 320 KB 320 KB 320 KBCHOLESKY 384 KB 512 KB 384 KB 512 KBFFT 320 KB 320 KB 320 KB 320 KBFMM 320 KB 320 KB 320 KB 320 KBLU 320 KB 320 KB 320 KB 320 KBOCEAN 320 KB 512 KB 1 MB 768 KBRADIX 320 KB 320 KB 512 KB 320 KBWATER-NSQUARED 320 KB 320 KB 320 KB 320 KBAverage (Required capacity toperform not less than the LRUcache %)

328 KB(64.1%)

368 KB(71.9 %)

440 KB(43.0%)

400 KB(39.1 %)

Table 4. Required capacity of the LRU + L2-RVC to outperform the traditional LRU cache

We investigate the efficiency of the proposed LRU+L2-RVC in Table 4 where the required capacityof the LRU+L2-RVC to outperform the traditional LRU cache is presented. The LRU cache of theLRU + L2-RVC is fixed to 256 KB in Table 4. We change the size of the L2-RVC not continuouslybut discretely (64 KB, 128 KB, 256 KB, and 512 KB).

To surpass the 512 KB LRU cache, the required capacity of the LRU + L2-RVC is averagely just328 KB (64.1 %) and 368 KB (71.9 %) in 16 processors and in 32 processors, respectively. When wecompare the LRU + L2-RVC to the 1 MB LRU cache, the efficiency of the LRU + L2-RVC is more

9

prominent. On an average, only 440 KB (43.0 %) and 400 KB (39.1 %) are required to outperform the1MB LRU cache, in 16 processors and in 32 processors, respectively.

3.2.2 LRU + L2-DAVCTable 5 shows the L2 cache hit rates of 512 KB LRU cache, 1 MB LRU cache, LRU + L2-RVC

(256 KB/256 KB), and LRU+ L2-DAVC (256 KB/256 KB). The total size of the split L2 cache is512KB. In the LRU + X, the size of the LRU cache and that of another X are half of total L2 cachesize. Table 6 shows the normalized execution time of 512 KB LRU cache, 1 MB LRU cache, LRU +L2-RVC (256 KB/256 KB), and LRU+ L2-DAVC (256 KB/256 KB). All results, in Table 6, arenormalized to the 512KB LRU cache.

L2 cacheorganization

LRU (512 KB) LRU (1 MB) LRU + L2-RVC LRU + L2-DAVC

Number ofProcessors

16 32 16 32 16 32 16 32

BARNES 0.90 0.87 0.92 0.89 0.90 0.89 0.90 0.88CHOLESKY 0.21 0.22 0.26 0.25 0.16 0.19 0.18 0.20FFT 0.18 0.27 0.19 0.29 0.22 0.28 0.23 0.32FMM 0.59 0.56 0.64 0.61 0.61 0.58 0.59 0.57LU 0.90 0.76 0.91 0.77 0.91 0.77 0.90 0.77OCEAN 0.74 0.79 0.88 0.86 0.72 0.81 0.69 0.80RADIX 0.55 0.56 0.58 0.59 0.51 0.58 0.52 0.59WATER-NSQUARED

0.64 0.54 0.65 0.55 0.65 0.54 0.66 0.55

Table 5. L2 cache hit rate

In some applications such as CHOLESKY, FFT, RADIX and WATER-NSQUARED, the L2 cachehit rates of the LRU + L2-DAVC are higher than those of the LRU + L2-RVC in 16 processors.However, it does not lead to the significant reduction of the execution time. The reason is that thedistances between the nodes are so short that the differences in the execution time between the L2-RVC and L2-DAVC are little. Also in the case that the L2 cache hit rate of the LRU + L2-DAVC islower than that of the LRU + L2-RVC in 16 processors, the differences of the execution time are littlein all the applications except OCEAN. In OCEAN, the difference between the L2 hit rate of the LRU+ L2-DAVC and that of the LRU + L2-RVC is large, resulting in much difference in execution time.

L2 cacheorganization

LRU (512 KB) LRU (1 MB) LRU + L2-RVC LRU + L2-DAVC

Number ofProcessors

16 32 16 32 16 32 16 32

BARNES 1.00 1.00 0.99 0.98 0.96 0.94 0.97 0.94CHOLESKY 1.00 1.00 0.99 1.00 0.95 0.97 0.98 1.03FFT 1.00 1.00 0.99 0.94 0.74 0.78 0.74 0.73FMM 1.00 1.00 0.99 0.97 0.92 0.87 0.93 0.88

10

LU 1.00 1.00 0.98 0.99 0.89 0.84 0.91 0.84OCEAN 1.00 1.00 0.55 0.69 0.81 0.82 0.95 0.80RADIX 1.00 1.00 0.96 0.95 0.86 0.76 0.86 0.75WATER-NSQUARED

1.00 1.00 1.00 1.00 0.98 0.94 0.98 0.94

Table 6. Normalized execution time

In OCEAN, the L2 cache hit rate of the LRU + L2-DAVC is lower than that of the LRU + L2-RVCin the 32 processor system. However, the execution time of the LRU + L2-DAVC is less than that ofthe LRU + L2-RVC, though the difference is not so large. This means that total hit counts at the L2-DAVC are less than those at the L2-RVC but the total miss time at the L2-DAVC is less than that atthe L2-RVC.

Strangely, though the L2 cache hit rate of the LRU + L2-DAVC is more than that of the LRU + L2-RVC in CHOLESKY, the execution time of the LRU + L2-DAVC is more than that of the LRU + L2-RVC. Recall that the LRU + L2-DAVC can contain the local blocks, while the LRU + L2-RVC onlycontains the remote blocks. In CHOLESKY, the accesses to the local block are frequent. Thus, inCHOLESKY, the hits at the L2-DAVC do not reduce the remote memory access time but reduce thelocal memory access time, which causes the unexpected results.

The increase of the number of processor leads to the increase of the distances between the nodes. Inthe simulations, the L2-DAVC leads to only small reduction in the execution time. With moreprocessors, the LRU + L2-DAVC is expected to outperform the LRU + L2-RVC since the LRU + L2-DAVC contains the blocks whose distances between the home node and the current node are long.

4. Conclusions and future workFor CC-NUMA multiprocessor systems, we have proposed two alternatives on split L2 caches : the

LRU + L2-RVC and the LRU + L2-DAVC. The proposed split L2 caches are believed to improve theperformance by keeping remote (or long-distance) blocks as well as recently used blocks.

The execution time with the split L2 cache decreases by up to 27%, compared to the traditionalnon-split L2 cache of the same size. We found, moreover, the 512KB split L2 cache could outperformthe double size (1MB) traditional L2 cache in most applications. It is encouraging that the LRU + L2-DAVC performs better than the LRU + L2-RVC as the number of processor increases. It is expectedthat the LRU + L2-DAVC performs much better than the LRU + L2-RVC with more number (64 or128) of processors.

In summary, we conclude that the surplus L2 cache area could be effectively utilized by adoptingour split cache organizations for CC-NUMA multiprocessors. The reason that we compare theproposed split caches with the traditional LRU cache is to make sure that we could utilize thesurplus L2 cache area to enhance the performance without any additional off-chip cache.

We agree that the values of the L2 cache sizes such as 256KB, 512KB, or 1MB are not determinatesince we restrict the simulations to the SPLASH-2 applications. We also admit that the proposed splitcaches may degrade the system performance when cache size is too small to run the application.However, when increasing the LRU cache does not improve the total system performance, theproposed split caches could be relied on as an alternative solution of providing multiprocessorsystems with good scalability.

11

AcknowledgementsWe would like to thank Craig B. Stunkel at the IBM T. J. Watson Research Center for his helpfulcomments.

References[1] S.C. Woo, et. al, “The SPLASH-2 Programs : Characterization and Methodological

Considerations,” International Symposium on Computer Architecture, pp.24-36, June 1995.[2] N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-

associative cache and prefetch buffers,” International Symposium on Computer Architecture, 1990.[3] L. A. Barroso, et.al, “Piranha: a scalable architecture based on single-chip multiprocessing,”

International Symposium on Computer Architecture, 2000.[4] J. E. Veenstra, R. J. Fowler, “MINT : A front end for efficient simulation of shared-memory

multiprocessors,” International Workshop on Modeling, Analysis and Simulation of Computer andTelecommunication Systems, pp.201-207, 1994.

[5] M. Moga and M. Dubois, “The effectiveness of SRAM network caches in clustered DSMs”,International Symposium on High Performance Computer Architecture, 1998.

[6] Hitoshi Oi and N. Ranganathan, “Utilization of cache area in on-chip multiprocessor”,International Symposium on High Performance Computing, 1999.

[7] Daniel Lenoski, et.al, “DASH Prototype: Implementation and Performance”, InternationalSymposium on Computer Architecture, pp. 92-103, 1992.

[8] R. Iyer and L. N. Bhuyan, “Switch cache : A framework for improving the remote memory accesslatency of CC-NUMA multiprocessors,” International Symposium on High PerformanceComputer Architecture, 1999.

[9] A. Gonzalez, Carlos Aliaga and M. Valero, “A data cache with multiple caching strategies tunedto different types of locality,” ACM International Conference on Supercomputing, pp.338-347,1995.

[10] J. Sahuquillo and A. Pont, “Two management approaches for the split data cache inmultiprocessor systems,” 8th Euromicro Workshop on Parallel and Distributed Processing, pp.301-308, 2000.

[11] Silicon Graphics, “SGI Origin 3000 Series : Bricks”, available at http://www.sgi.com/origin/3000/ bricks.html

[12] N. P. Jouppi, S. J. E. Wilton, “Tradeoffs in Two-level On-Chip Caching,” ACM SIGARCHComputer Architecture News, vol.22(2), pp. 34-45, 1994

[13] Compaq Computer Corporation, “21264/EV68CB and 21264/EV68DC Hardware ReferenceManual”, 2001

[14] Intel Corporation, “IA-32 Intel Architecture Software Developer’s Manual”, 2003

12

!"#

!

"# $"%

!

Figure 3. L2 cache hit rate (16 processors)

13

! " #

!"

!"#$"!

%&

Figure 3. L2 cache hit rate (16 processors) - continued

14

Figure 4. Normalized execution time (16 processors)

15

Figure 4. Normalized execution time (16 processors) -continued

16

!"#

$$%

!"#

Figure 5. L2 cache hit rate (32 processors)

!

"!#$%&'(

!

17

!"

Figure 5. L2 cache hit rate (32 processors) - continued

! " #

$% "&'! (

! " #

18

Figure 6. Normalized execution time (32 processors)

19

Figure 6. Normalized execution time (32 processors) - continued

Traditional versus Next-Generation Journaling File Systems:a Performance Comparison Approach

Juan Piernas – Universidad de MurciaToni Cortes – Universitat Politecnica de Catalunya

Jose M. Garcıa – Universidad de Murcia

Abstract

DualFS is a next-generation journaling file system whichhas the same consistency guaranties as traditional journal-ing file systems but better performance. This paper sum-marizes the main features of DualFS, introduces three newenhancements which significantly improve DualFS perfor-mance during normal operation, and presents different ex-perimental results which compare DualFS and other tradi-tional file systems, namely, Ext2, Ext3, XFS, JFS, and Reis-erFS. The experiments carried out cover a wide range ofcases, from microbenchmarks to macrobenchmarks, fromdesktop to server workloads. As we will see, these exper-iments not only show that DualFS has the lowest I/O timein general, but also that it can remarkably boost the overallsystem performance for many workloads (up to 98%).

1 Introduction

High-performance computing environments rely on severalhardware and software elements to fulfill their jobs. Al-though some environments require specialized elements,many of them use off-the-shelf elements which must havea good performance in order to accomplish computing jobsas quickly as possible.

A key element in these environments is the file system.File systems provide a friendly means to store final and par-tial results, from which it is even possible to resume a failedjob because of a system crash, program error, or whateverother fail. Some computing jobs require a fast file system,others a quickly-recoverable file system, and other jobs re-quire both features.

DualFS [16] is a new concept of journaling file systemwhich is aimed at providing both good performance andfast consistency recovery. DualFS, like other file systems[5, 7, 8, 18, 24, 28], uses a meta-data log approach for

This work has been supported by the Spanish Ministry of Science and

Technology and the European Union (FEDER funds) under the TIC2000-1151-C07-03 and TIC2001-0995-C02-01 CICYT grants.

fast consistency recovery, but from a quite different point ofview. Our new file system separates data blocks and meta-data blocks completely, and manages them in very differentways. Data blocks are placed on the data device (a diskpartition) whereas meta-data blocks are placed on the meta-data device (another partition in the same disk). In order toguarantee a fast consistency recovery after a system crash,the meta-data device is treated as a log-structured file sys-tem [18], whereas the data device is divided into groupsin order to organize data blocks (like other file systems do[4, 8, 13, 24, 27]).

Although the separation between data and meta-data isnot a new idea [1, 15], this paper introduces a new versionof DualFS which proves, for the first time, that that sepa-ration can significantly improve file systems’ performancewithout requiring several storage devices, unlike previousproposals.

The new version of DualFS has several important im-provements with respect to the previous one [16]. First ofall, DualFS has been ported to the 2.4.19 Linux kernel. Sec-ondly, we have greatly changed the implementation and re-moved some important bottlenecks. And finally, we haveadded three new enhancements which take advantage of thespecial structure of DualFS, and significantly increase itsperformance. These enhancements are: meta-data prefetch-ing, on-line meta-data relocation, and directory affinity.

The resultant file system has been compared with Ext2[4], Ext3 [27], XFS [22, 24], JFS [8], and ReiserFS [25],through an extensive set of benchmarks, and the experimen-tal results achieved not only show that DualFS is the bestfile system in general, but also that it can remarkably boostthe overall system performance in many cases.

The rest of this paper is organized as follows. Section 2summarizes the main features of DualFS, and introducesthe three new enhancements. In Section 3, we describeour benchmarking experiments, whose results are shown inSection 4. We conclude in Section 5.

1

2 DualFS

In this section we summarize the main features of DualFS,and describe the three enhancements that we added to thenew version of DualFS analyzed in this paper.

2.1 Design and Implementation

The key idea of our new file system is to manage data andmeta-data in completely different ways, giving a specialtreatment to meta-data [16]. Meta-data blocks are locatedon the meta-data device, whereas data blocks are located onthe data device (see Figure 1).

Note that just separating meta-data blocks from datablocks is not a new idea. However, we will show thatthat separation, along with a special meta-data manage-ment, can significantly improve performance without re-quiring extra hardware (previous proposals of separation ofdata and meta-data, such as the multi-structured file sys-tem [15], and the interposed request routing [1], use severalstorage devices). In our experiments, the data device andthe meta-data device will be two adjacent partitions on thesame disk. This is equivalent to having a single partitionwith two areas: one for data blocks and another for meta-data blocks.

Data Device

Data blocks of regular files are on the data device, wherethey are treated as many Unix file systems do. The data de-vice only has data blocks, and uses the concept of group ofdata blocks (similar to the cylinder group concept) in orderto organize data blocks (see Figure 1). Grouping is per-formed in a per directory basis, i.e., data blocks of regularfiles in the same directory are in the same group (or in neargroups if the corresponding group is full).

From the file-system point of view, data blocks are notimportant for consistency, so they are not written syn-chronously and do not receive any special treatment, asmeta-data does [8, 20, 24, 28]. However, they must be takeninto account for security reasons. When a new data blockis added to a file, DualFS writes it to disk before writingits related meta-data blocks. Missing out this requirementwould not actually damage the consistency, but it could po-tentially lead to a file containing a previous file’s contentsafter crash recovery, which is a security risk.

Meta-data Device

Meta-data blocks are in the meta-data device which is orga-nized as a log-structured file system (see Figure 1). It is im-portant to clarify that as meta-data we understand all these

items: i-nodes, indirect blocks, directory “data” blocks,and symbolic link “data” blocks. Obviously, bitmaps, su-perblock copies, etc., are also meta-data elements.

Our implementation of log-structured file system formeta-data is based on the BSD-LFS one [19]. However,it is important to note that, unlike BSD-LFS, our log doesnot contain data blocks, only meta-data ones. Another dif-ference is the IFile. Our IFile implementation has twoadditional elements which manage the data device: thedata-block group descriptor table, and the data-block groupbitmap table. The former basically has a free data-blockcount and an i-node count per group, while the latter indi-cates which blocks are free and which are busy, in everygroup.

DualFS writes dirty meta-data blocks to the log everyfive seconds and, like other journaling systems, it uses thelog after a system crash to recover the file system consis-tency quickly from the last checkpoint (checkpointing isperformed every sixty seconds). Note, however, that oneof the main differences between DualFS and current jour-naling file systems [5, 24, 28] is the fact that DualFS onlyhas to write one copy of every meta-data block. This copyis written in the log, which is used by DualFS both to readand write meta-data blocks. In this way, DualFS avoids atime-expensive extra copy of meta-data blocks. Further-more, since meta-data blocks are written only once andin big chunks in the log, the average write request size isgreater in DualFS than in other file systems where meta-data blocks are spread. This causes less write requests and,hence, a greater write performance [16].

Finally, it is important to note that our file system is con-sidered consistent when information about meta-data is cor-rect. Like other approaches (JFS [8], SoftUpdates [14],XFS [24], and VxFS [28]), DualFS allows some loss of datain the event of a system crash.

Cleaner

Since our meta-data device is a log-structured file system,we need a segment cleaner for it. Our cleaner is startedin two cases: (a) every 5 seconds, if the number of cleansegments drops below a specific threshold, and (b) whenwe need a new clean segment, and all segments are dirty.

At the moment our attention is not on the cleaner, sowe have implemented a simple one, based on Rosenblum’scleaner [18]. Despite its simplicity, this cleaner has a verysmall impact on DualFS performance [16].

2.2 Enhancements

Reading a regular file in DualFS is inefficient because datablocks and their related meta-data blocks are a long way

2

GROUP 2GROUP 1 GROUP N−3 GROUP N−1GROUP N−2GROUP 0 SEGMENT 0 SEGMENT K−1SEGMENT 1

Writes in the log

DATA DEVICE META−DATA DEVICE

Superblock

DISK

Superblock

Figure 1: DualFS overview. Arrows are pointers stored in segments of the meta-data device. These pointers are numbers of blocks inboth the data device (data blocks) and the meta-data device (meta-data blocks).

from each other, and many long seeks are required.In this section we present a solution to that prob-

lem, meta-data prefetching, a mechanism to improve thatprefetching, on-line meta-data relocation, and a third mech-anism to improve the overall performance of data opera-tions, directory affinity.

Meta-data Prefetching

A solution to the read problem in DualFS is meta-dataprefetching. If we are able to read a lot of meta-data blocksof a file in a single disk I/O operation, then disk heads willstay in the data zone for a long time. This will eliminatealmost all the long seeks between the data device and themeta-data device, and it will produce bigger data read re-quests.

The meta-data prefetching implemented in DualFS isvery simple: when a meta-data block is required, DualFSreads a group of consecutive meta-data blocks from disk,where the meta-data block required is. Prefetching is notperformed when the meta-data block requested is alreadyin memory. The idea is not to force an unnecessary disk I/Orequest, but to take advantage of a compulsory disk-headmovement to the meta-data zone when a meta-data blockmust be read, and to prefetch some meta-data blocks then.Since all meta-data blocks prefetched are consecutive, wealso take advantage of the built-in cache of the disk drive.

But, in order to be efficient, our simple prefetching re-quires some kind of meta-data locality. The DualFS meta-data device is a log where meta-data blocks are sequentiallywritten in chunks called “partial segments”. All meta-datablocks in a partial segment have been created or modified atthe same time. Hence, some kind of relationship betweenthem is expected. Moreover, many relationships betweenmeta-data blocks are due to those meta-data blocks belong-ing to the same file (e.g., indirect blocks). This kind of tem-poral and spatial meta-data locality presents in the DualFS

log is which makes our prefetching highly efficient.Once we have decided when to prefetch, the next step is

to decide which and how many meta-data blocks must beprefetched. This decision depends on the meta-data layoutin the DualFS log, and this layout, in turn, depends on themeta-data write order.

In DualFS, meta-data blocks belonging to the same fileare written to the log in the following order: “data” blocks(of directories and symbolic links), simple indirect blocks,double indirect blocks, and triple indirect blocks. Further-more, “data” blocks are written in inverse logical order.Files are also written in inverse order: first, the last modi-fied (or created) file, and last, the first modified (or created)file. Note that the important fact is not that the logical orderis “inverse”; which is important is the fact that the logicalorder is “known”.

Taking into account all the above, we can see that thegreater part of the prefetched meta-data blocks must be justbefore the required meta-data block. However, since somefiles can be read in an order slightly different to the one inwhich they were written, we must also prefetch some meta-data blocks which are located after the requested meta-datablock.

Finally, note that, unlike other prefetching methods [10,11], DualFS prefetching is I/O-time efficient, that is, it doesnot cause extra I/O requests which can in turn cause longseeks. Also note that our simple prefetching mechanismworks because it takes advantage of the unique features ofour meta-data device.

On-line Meta-data Relocation

The meta-data prefetching efficiency may deteriorate due toseveral reasons:

files are read in an order which is very different fromthe order in which they were written.

3

the read pattern can change over the course of the time.

file-system aging.

An inefficient prefetching increases the number of meta-data read requests and, hence, the number of disk-headmovements between the data zone and the meta-data zone.Moreover, it can become counterproductive because it canfill the buffer cache up with useless meta-data blocks. Inorder to avoid prefetching degradation (and to improve itsperformance in some cases), we have implemented an on-line meta-data relocation mechanism in DualFS which in-creases temporal and spatial meta-data locality.

The meta-data relocation works as follows: when a meta-data element (i-node, indirect block, directory block, etc.)is read, it is written in the log like any other just modifiedmeta-data element. Note that it does not matter if the meta-data element was already in memory when it was read or ifit was read from disk.

Meta-data relocation is mainly performed in two cases:

when reading a regular file (its indirect blocks are re-located).

when reading a directory (its “data” and indirectblocks are relocated).

This constant relocation adds more meta-data writes.However, meta-data writes are very efficient in DualFS be-cause they are performed sequentially and in big chunks.Therefore, it is expected that this relocation, even when veryaggressive, does not increase the total I/O time significantly.

Since a file is usually read in its entirety [2], the meta-data relocation puts together all meta-data blocks of the file.The next time the file is read, all its meta-data blocks willbe together, and the prefetching will be very efficient.

The meta-data relocation also puts together the meta-datablocks of different files read at the same time (i.e., the meta-data blocks of a directory and its regular files). If thosefiles are also read later at the same time, and even in thesame order, the prefetching will work very well. This as-sumption in the read order is made by many prefetchingtechniques [10, 11]. It is important to note that our meta-data relocation exposes the relationships between files bywriting together the meta-data blocks of the files which areread at the same time. Unlike other prefetching techniques[10, 11], this relationship is permanently recorded on disk,and it can be exploited by our prefetching mechanism aftera system restart.

This kind of on-line meta-data relocation is in a sensesimilar to the work proposed by Jeanna N. Matthews et al.[12]. They take advantage of cached data in order to reducecleaning costs. We write recently read and, hence, cached

meta-data blocks to the log in order to improve meta-datalocality and prefetching. They also propose a dynamic re-organization of data blocks to improve read performance.However, that reorganization is not an on-line one, and it ismore complex than ours.

Finally, we must explain an implicit meta-data relocationwhich we have not mentioned yet. When a file is read, theaccess time field in its i-node must be updated. If severalfiles are read at the same time, their i-nodes will also beupdated at the same time, and written together in the log. Inthis way, when an i-node is read later, the other i-nodes inthe same block will also be read. This implicit prefetchingis very important since it exploits the temporal locality inthe log, and can potentially reduce file open latencies, andthe overall I/O time. Note that this i-node relocation alwaystakes place, without an explicit meta-data relocation, exceptwhen the file system is read-only mounted.

This implicit meta-data relocation can have an effect sim-ilar to that achieved by the embedded i-nodes proposed byGanger and Kaashoek [6]. In their proposal, i-nodes of filesin the same directory are put together and stored inside thedirectory itself. Note, however, that they exploit the spa-tial locality, whereas we exploit both spatial and temporallocalities.

Directory Affinity

In DualFS, like in Ext2 and Ext3, when a new directory iscreated it is assigned a data-block group in the data device.Data blocks of the regular files created in that directory areput in the group assigned to the directory. DualFS specifi-cally selects the emptiest data-block group for the new di-rectory.

Note that, when creating a directory tree, DualFS has toselect a new group every time a directory is created. Thiscan cause a change of group and, hence, a large seek. How-ever, the new group for the new directory may be onlyslightly better than that of the parent directory. A betteroption is to remain in the parent directory’s group, and tochange only when the new group is good enough, accordingto a specific threshold (for example, we can change to thenew group if it has

% more free data blocks than the par-

ent directory’s group). The latter is called directory affinityin DualFS, and can greatly improve DualFS performance,even for a threshold as little as 10%.

Directory affinity in DualFS is somewhat similar to the“directory affinity” concept used by Anderson et al. [1],but with some important differences. For example, theirdirectory affinity is a probability (the probability that a newdirectory is placed on the same server as its parent) whereasit is a threshold in our case.

4

3 Experimental Methodology

This section describes how we have evaluated the new ver-sion of DualFS. We have used both microbenchmarks andmacrobenchmarks for different configurations, using theLinux kernel 2.4.19. We have compared DualFS againstExt2 [4], the default file system in Linux, and four jour-naling file systems: Ext3 [27], XFS [22, 24], JFS [8], andReiserFS [25].

Ext2 is not a journaling file system, but it has been in-cluded because it is a widely-used and well-understood filesystem. Ext2 is an FFS-like file system [13] which does notwrite modified meta-data elements synchronously. Instead,it marks a meta-data element to be written in 5 seconds.Furthermore, meta-data elements involved in a file systemoperation are modified, and marked to be written, in a spe-cific order (although this order is not compulsory, and de-pends on the disk scheduler). In this way, Ext2 can almostalways recover its consistency after a system crash, withoutsignificantly damaging the performance.

Ext3 is a journaling file system derived from Ext2 whichprovides different consistency levels through mount op-tions. In our tests, the mount option used has been “-odata=ordered”, which provides Ext3 with a behaviorsimilar to that of DualFS. With this option, Ext3 only jour-nals meta-data changes, but flushes data updates to disk be-fore any transactions commit.

Based on SGI’s Irix XFS file system technology, XFS isa journaling file system which supports meta-data journal-ing. XFS uses allocation groups and extent-based alloca-tions to improve locality of data on disk. This results in im-proved performance, particularly for large sequential trans-fers. Performance features include asynchronous write-ahead logging (similar to Ext3 with data=writeback),balanced binary trees for most file-system meta-data, de-layed allocation, and dynamic disk i-node allocation.

IBM’s JFS originated on AIX, and from there was portedto Linux. JFS is also a journaling file system which supportsmeta-data logging. Its technical features include extent-based storage allocation, dynamic disk i-node allocation,asynchronous write-ahead logging, and sparse and densefile support.

ReiserFS is a journaling file system which is speciallyintended to improve performance of small files, to use diskspace more efficiently, and to speed up operations on di-rectories with thousands of files. Like other journaling filesystems, ReiserFS only journals meta-data. In our tests,the mount option used has been “-o notail” which no-tably improves ReiserFS performance at the expense of bet-ter disk-space use.

Bryant et al. [3] compared the above five file systems

using several benchmarks on three different systems, rang-ing in size from a single-user workstation to a 28-processorccNUMA machine. However, there are some important dif-ferences between their work and ours. Firstly, we evaluate anext-generation journaling file system (DualFS). Secondly,we report results for some industry-standard benchmarks(SpecWeb99, and TPC-C). Thirdly, we use microbench-marks which are able to clearly expose performance differ-ences between file systems (in their single-user workstationsystem, for example, only one benchmark was able to showperformance differences). And finally, we also report I/Otime at disk level (this time is very important because re-flects the behavior of every file system as seen by the diskdrive).

3.1 Microbenchmarks

Microbenchmarks are intended to discover strengths andweaknesses of every file system. We have designed sevenbenchmarks:

read-meta (r-m) find files larger than 2 KB in a directorytree.

read-data-meta (r-dm) read all the regular files in a direc-tory tree.

write-meta (w-m) untar an archive which contains a direc-tory tree with empty files.

write-data-meta (w-dm) the same as w-m, but with non-empty files.

read-write-meta (rw-m) copy a directory tree with emptyfiles.

read-write-data-meta (rw-dm) the same as rw-m, butwith non-empty files.

delete (del) delete a directory tree.

In all cases, the directory tree has 4 subdirectories, eachwith a copy of a clean Linux 2.4.19 source tree. In the“write-meta” and “read-write-meta” benchmarks, we havetruncated to zero all regular files in the directory tree. Inthe “write-meta” and “write-data-meta” tests, the untarredarchive is in a file system which is not being analyzed.

All tests have been run 5 times for 1 and 4 processes.When there are 4 processes, each works on its own Linuxsource tree copy.

5

3.2 Macrobenchmarks

Next, we list the benchmarks we have performed to studythe viability of our proposal. Note that we have chosen en-vironments that are currently representative.

Kernel Compilation for 1 Process (KC-1P) resolve de-pendencies (make dep) and compile the Linux kernel2.4.19, given a common configuration. Kernel andmodules compilation phases are made for 1 process(make bzImage, and make modules).

Kernel Compilation for 4 Processes (KC-4P) the sameas before, but just for 4 processes (make -j4 bzImage,and make -j4 modules).

SpecWeb99 (SW99) the SPECweb99 benchmark [23].We have used two machines: a server, with the file sys-tem to be analyzed, and a client. Network is a FastEth-ernet LAN.

PostMark (PM) the PostMark benchmark, which was de-signed by Jeffrey Katcher to model the workload seenby Internet Service Providers under heavy load [9].We have run our experiments using version 1.5 of thebenchmark. In our case, the benchmark initially cre-ates 150,000 files with a size range of 512 bytes to16 KB, spread across 150 subdirectories. Then, it per-forms 20,000 transactions with no bias toward any par-ticular transaction type, and with a transaction block of512 bytes.

TPCC-UVA (TPC-C) an implementation of the TPC-Cbenchmark [26]. Due to system limitations, we haveonly used 3 warehouses. The benchmark is run withan initial 30 minutes warm-up stage, and a subsequentmeasure time of 2 hours.

3.3 Tested Configurations

All benchmarks have been run for the six configurationsshowed below. Mount options have been selected followingrecommendations by Bryant et al. [3]:

Ext2 Ext2 without any special mount option.

Ext3 Ext3 with “-o data=ordered” mount option.

XFS XFS with “-o logbufs=8,osyncisdsync”mount options.

JFS JFS without any special mount option.

ReiserFS ReiserFS with “-o notail” mount option.

Linux PlatformProcessor Two 450 Mhz Pentium IIIMemory 256 MB, PC100 SDRAM

Disk One 4 GB IDE 5,400 RPM Seagate ST-34310A.One 4GB SCSI 10,000 RPM FUJITSUMAC3045SC.SCSI disk: Operating system, swap andtrace log.IDE disk: test disk

OS Linux 2.4.19

Table 1: System Under Test

DualFS DualFS with meta-data prefetching, on-line meta-data relocation, and directory affinity.

Versions of Ext2, Ext3, and ReiserFS are those found ina standard Linux kernel 2.4.19. XFS version is 1.1, and JFSversion is 1.1.1.

All but DualFS are on one IDE disk with only one parti-tion, and their logical block size is 4 KB. DualFS is also onone IDE disk, but with two adjacent partitions. The meta-data device is always on the outer partition, given that thispartition is faster than the inner one [30], and its size is 10%of the total disk space. The inner partition is the data device.Both the data and the meta-data devices also have a logicalblock size of 4 KB.

Finally, DualFS has a prefetching size of 16 blocks, anda directory affinity of 10% (that is, a new-created directoryis assigned the best group, instead of its parent directory’sgroup, if the best group has, at least, 10% more free datablocks). Since the logical block size is 4 KB, the prefetch-ing size is equal to 64 KB, precisely the biggest disk I/Otransfer size used by default by Linux.

3.4 System Under Test

All tests are done in the same machine. The configurationis shown in Table 1.

In order to trace disk I/O activity, we have instrumentedthe operating system (Linux 2.4.19) to record when a re-quest starts and finishes. The messages generated by ourtrace system are logged in an SCSI disk which is not usedfor evaluation purposes.

Messages are printed using the kernel function printk.This function writes messages in a circular buffer in mainmemory, so the delay inserted by our trace system is small( 1%), especially if compared to the time required to per-form a disk access. Later, these messages are read throughthe /proc/kmsg interface, and then written to the SCSIdisk in big chunks. In order to avoid the loss of messages

6

(last messages can overwrite the older ones), we have in-creased the circular buffer size from 16 KB to 1 MB, andwe have given maximum priority to all processes involvedin the trace system.

4 Experimental Results

In order to better understand the different file systems, weshow two performance results for each file system in eachbenchmark:

Disk I/O time: the total time taken for all disk I/O op-erations.

Performance: the performance achieved by the filesystem in the benchmark.

The performance result is the application time except forthe SpecWeb99 and TPC-C benchmarks. Since these mac-robenchmarks take a fixed time specified by the benchmarkitself, we will use the benchmark metrics (number of simul-taneous connections for SpecWeb99, and tpmC for TPC-C)as throughput measurements.

We could give the application time only, but given thatsome macrobenchmarks (e.g, compilation) are CPU-boundin our system, we find I/O time more useful in those cases.The total I/O time can give us an idea of to what extent thestorage system can be loaded. A file system that loads a diskless than other file systems makes it possible to increasethe number of applications which perform disk operationsconcurrently.

Figures below show the confidence intervals for the meanas error bars, for a 95% confidence level calculated usinga Student’s-T statistic. The number inside each bar is theheight of the bar, which is a value normalized with respectto Ext2. For Ext2 the height is always 1, so we have writtenthe absolute value for Ext2 inside each Ext2 bar, which isuseful for comparison purposes between figures.

The DualFS version evaluated in this section includes thethree enhancements described previously. However, youcan look at our previous work [17] if you want to knowthe impact of each enhancement (except directory affinity)on DualFS performance.

Finally, it is important to note that all benchmarks havebeen run with a cold file system cache (the computer isrestarted after every test run).

4.1 Microbenchmarks

Figures 2 and 3 show the microbenchmarks’ results. Aquick review shows that DualFS has the lowest disk I/Otimes and application times in general, in both write and

read operations, and that JFS has the worst performance.Only ReiserFS is clearly better than DualFS in the write-data-meta benchmark, and it is even better when there arefour processes. However, ReiserFS performance is verypoor in the read-data-meta case. This is a serious problemfor ReiserFS given that reads are a more frequent operationthan writes [2, 29]. In order to understand these results, wemust explain some ReiserFS features.

ReiserFS does not use the block group or cylinder groupconcepts like the other file systems (Ext2, Ext3, XFS, JFS,and DualFS). Instead, ReiserFS uses an allocation algo-rithm which allocates blocks almost sequentially when thefile system is empty. This allocation starts at the beginningof the file system, after the last block of the meta-data log.Since many blocks are allocated sequentially, they are alsowritten sequentially. The other file systems, however, havedata blocks spread across different groups which take up theentire file system, so writes are not as efficient as in Reis-erFS. This explains the good performance of ReiserFS inthe write-data-meta test.

The above also explains why ReiserFS is not so bad in theread-data-meta test when there are four processes. Sincethe directory tree read by the processes is small, and it iscreated in an empty file system, all its blocks are togetherat the beginning of the file system. This makes processes’read requests to cause small seeks when processes read thedirectory tree. The same does not occur in the other filesystems, where the four processes cause long seeks becausethey read blocks which are in different groups spread acrossthe entire disk.

Another interesting point in the write-data-meta bench-mark is the performance of ReiserFS and DualFS with re-spect to Ext2. Although the disk I/O times of both file sys-tems are much better than that of Ext2, the application timesare not so good. This indicates that Ext2 achieves a betteroverlap between I/O and CPU operations because it neitherhas a journal nor forces a meta-data write order.

It is specially remarkable the good performance of Du-alFS in the read-data-meta benchmark, where DualFS is upto 35% faster than ReiserFS. When compared with previ-ous versions of DualFS [16], we can see that this perfor-mance is mainly provided by the the meta-data prefetchingand directory affinity mechanisms, which respectively re-duce the number of long seeks between data and meta-datapartitions, and between data-block groups within the datapartition.

In all the metadata-only benchmarks but write-meta, thedistinguished winners (regarding the application time) areReiserFS and DualFS. And they are the absolute winnerstaking into account the disk I/O time. Also note that DualFSis the best when the number of processes is four.

7

1 PROCESS

2.44

1.01

1.29

1.02 1.

13

1.06

1.94

1.07

1.58

1.46

1.06

2.15

0.85

1.36

1.20

2.14

1.05

0.89

0.83

1.80

74.5

8 se

c

253.

07 s

ec

303.

1 se

c

5.35

sec

46.2

8 se

c

81.7

4 se

c

67.1

3 se

c

4.36

5.10

1.49

1.16

0.65

4.70

8.44

4.81

0.28

0.26

0.17 0.20

0.30

0.20

00.

51

1.5

22.

53

w-dm r-dm rw-dm w-m r-m rw-m del

benchmark

No

rmal

ized

Ap

plic

atio

n T

ime

Ext2Ext3XFSJFSReiserFSDualFS

17.8

9

4 PROCESSES

2.37

1.00 1.

15

1.01

1.49

1.01

1.85

1.05 1.

18

1.54

3.10

1.82

3.95

0.63

1.13

1.03

0.99

0.78

0.77

2.06

91.1

2 se

c

66.1

1 se

c

54.0

3 se

c

3.68

sec

640.

70 s

ec

535.

20 s

ec

72.9

7 se

c

0.951.

10

0.65

5.01

0.200.

30

0.16

2.89

0.11

0.30

0.12

00.

51

1.5

22.

53


benchmark

No

rmal

ized

Ap

plic

atio

n T

ime


6.13

7.99

40.7

3

12.4

0

Figure 2: Microbenchmarks results (application time).

The problem of the write-meta benchmark is that it isCPU-bound because all modified meta-data blocks fit inmain memory. In this benchmark, Ext2 is very efficientwhereas the journaling file systems are big consumers of

CPU time in general. Even then, DualFS is the best of thefive. Regarding the disk I/O time, ReiserFS and DualFS arethe best, as expected, because they write meta-data blocksto disk sequentially.

8

1 PROCESS

1.81

1.01

1.25

1.09

1.01 1.06

1.02

1.54

1.07

1.49

0.50

1.76

4.88

4.16

0.95

1.96

4.81

0.54

1.57

1.09

0.73

0.76

0.70 66

.68

sec

84.3

6 se

c

47.8

6 se

c

22.3

6 se

c

287.

49 s

ec

150.

54 s

ec

95.9

3 se

c 1.40

0.941.

11

6.67

0.15

0.05

0.06

0.04 0.

11

0.10

0.07

0.05

00.

51

1.5

22.

53


Benchmark

No

mar

lized

I/O

Tim

e


4 PROCESSES

1.69

1.00 1.

14 1.24

1.01 1.

17

1.02

1.35

1.05 1.

18

0.62

1.56

3.15

1.72

3.61

3.82

0.47

1.14

1.02

0.73

0.77

0.76 94

.94

sec

92.1

1 se

c

56.9

9 se

c

22.9

4 se

c

643.

95 s

ec

525.

22 s

ec

107.

32 s

ec

0.84

1.30

0.92

6.55

8.69

0.17

0.080.11

0.05 0.

10

0.09

0.08

0.05

00.

51

1.5

22.

53


Benchmark

No

rmal

ized

I/O

Tim

e


Figure 3: Microbenchmarks results (disk I/O time).

In the read-meta case, ReiserFS and DualFS have thebest performance because they read meta-data blocks whichare very close. Ext2, Ext3, XFS, and JFS, however, havemeta-data blocks spread across the storage device, which

causes long disk-head movements. Note that the DualFSperformance is even better when there are four processes.

This increase in DualFS performance, when the numberof processes goes from one to four, is due to the meta-data

9

I/O Time (seconds)File System 1 process 4 processes Increase (%)

Ext2 47.86 (0.35) 56.99 (1.13) 19.08Ext3 48.45 (0.29) 57.59 (0.35) 18.86XFS 24.00 (0.66) 35.13 (1.15) 46.38JFS 45.49 (0.46) 98.10 (0.44) 115.65ReiserFS 7.02 (1.44) 9.77 (2.32) 39.17DualFS 5.46 (2.00) 5.90 (3.33) 8.06

Table 2: Results of the r-m test. The value in parenthesis is theconfidence interval given as percentage of the mean.

prefetching. Indeed, prefetching makes DualFS scalablewith the number of processes. Table 2 shows the I/O timefor the six file systems studied, and for one and four pro-cesses. Whereas Ext2, Ext3, XFS, JFS, and ReiserFS sig-nificantly increase the total I/O time when the number ofprocesses goes from one to four, DualFS increases the I/Otime slightly.

For one process, the high meta-data locality in the Du-alFS log and the implicit prefetching performed by the diskdrive (through the read-ahead mechanism) make the dif-ference between DualFS, and Ext2, Ext3, JFS, and XFS.ReiserFS also takes advantage of the same disk read-aheadmechanism. However, that implicit prefetching performedby the disk drive is less effective if the number of processesis two or greater. When there are four processes, the diskheads constantly go from track to track because each pro-cess works with a different area of the meta-data device.When the disk drive reads a new track, the previous readtrack is evicted from the built-in disk cache, and its meta-data blocks are discarded before being requested by the pro-cess which caused the read of the track.

The explicit prefetching performed by DualFS solves theabove problem by copying meta-data blocks from the built-in disk cache to the buffer cache in main memory before be-ing evicted. Meta-data blocks can stay in the buffer cachefor a long time, whereas meta-data blocks in the built-incache will be evicted soon, when the disk drive reads an-other track.

Another remarkable point in the read-meta benchmark isthe XFS performance. Although XFS has meta-data blocksspread across the storage device like Ext2 and Ext3, its per-formance is much better. We have analyzed XFS disk I/Otraces and we have found out that XFS does not update the“atime” of directories by default (we have not specified any“noatime” or “nodiratime” mount options for any file sys-tem). The lack of meta-data writes in XFS reduces the totalI/O time because there are fewer disk operations, and be-cause the average time of the read requests is smaller. JFSdoes not update the “atime” of directories either, but that

does not significantly reduce its I/O time.

In the last benchmark, del, the XFS behavior is odd again.For one process, it has a very bad performance. However,the performance improves when there are four processes.The other file systems have the behavior we expect.

Finally, note the great performance of DualFS in theread-data-meta, and read-meta benchmarks despite the on-line meta-data relocation.

4.2 Macrobenchmarks

Results of macrobenchmarks are showed in Figure 4. Sincebenchmark metrics are different, we have showed the rela-tive application performance with respect to Ext2 for everyfile system.

As we can see, the only I/O-bound benchmark is Post-Mark (that is, benchmark results agree with I/O time re-sults). The other four benchmarks are CPU-bound in oursystem, and all file systems consequently have a similarperformance. Nevertheless, DualFS is usually a little bet-ter than the other file systems. The reason to include theseCPU-bound benchmarks is that they are very common forsystem evaluations.

From the disk I/O time point of view, DualFS has thebest throughput. Only XFS is better than DualFS in theSpecWeb99 and TPC-C benchmarks. However, we mustrespectively take into account that XFS does not update theaccess time of directory i-nodes, and nor does it meet theTPC-C benchmark requirements (specifically, it does notmeet the response time constraints for new-order and order-status transactions).

In order to explain this superiority of DualFS over theother file systems, we have analyzed the disk I/O traces ob-tained, and we have found out that performance differencesbetween file systems are mainly due to writes. There are alot of write operations in these tests, and DualFS is the filesystem which better carries out them. For JFS, however,these performance differences are also due to reads, whichtake longer than in the other file systems.

Internet System Providers should pay special attentionto the results achieved by DualFS in the PostMark bench-mark. In this test, DualFS achieves 60% more transactionsper second than Ext2 and Ext3, twice as many transactionsper second as ReiserFS, almost three times as many transac-tions per second as XFS, and four as many transactions persecond as JFS. Although there are specific file systems forInternet workloads [21], note that these results are achievedby a general-purpose file system, DualFS, which is intendedfor a variety of workloads.

10

-1.0

0

-0.1

3

-0.7

3

-1.1

1-0.8

0

-2.2

2

-3.1

8

-8.4

6

-1.8

7

-3.6

9

0.030.12

1.75

0.00

3.44

-69.

89

0.28

-1.6

7

-1.4

7

0.20

-19.

29

1.59

40.7

4

0.00

-6-4

-20

24

6

KC-1P KC-4P PM SW99 TPC-C

Benchmark

Imp

rove

men

t o

n E

xt2

(%)

Ext3XFSJFSReiserFSDualFS

-129

.01

PM

1.18

1.07

0.96

1.38

1.15

0.89

0.88

1.69

0.37

0.93

2.05 2.

18 2.32

2.16

1.231.

43

0.91

1.18

3.41

1.02

0.68

0.72

0.56

0.46

0.98

54.9

8 se

c

87.8

6 se

c

926.

83 s

ec

124.

92 s

ec

4155

.05

sec

00.

51

1.5

22.

53

3.5

4

KC-1P KC-4P PM SW99 TPC-C

Benchmark

No

mar

lized

I/O

Tim

e

Ext2

Ext3

XFS

JFS

ReiserFS

DualFS

Figure 4: Macrobenchmark results. The first figure shows the application throughput improvement achieved by each file system withrespect to Ext2. The second figure shows the disk I/O time normalized with respect to Ext2. In the TPC-C benchmark, Ext3 and XFSbars are striped because they do not meet TPC-C benchmark requirements.

5 Conclusions

Improving file system performance is important for awide variety of systems, from general purpose systems tomore specialized high-performance systems. Many high-

performance systems, for example, rely on off-the-shelf filesystems to store final and partial results, and to resumefailed computation.

In this paper we have introduced a new version of DualFSwhich proves, for the first time, that a new journaling file-

11

system design based on data and meta-data separation, andspecial meta-data management, is not only possible but alsodesirable.

Through an extensive set of micro- and macrobench-marks, we have evaluated six different journaling and non-journaling file systems (Ext2, Ext3, XFS, JFS, ReiserFS,and DualFS), and the experimental results obtained not onlyhave shown that DualFS has the lowest I/O time in general,but also that it can significantly boost the overall file systemperformance for many workloads (up to 98%).

We feel, however, that the new design of DualFS allowsfurther improvements which can increase DualFS perfor-mance even more. Currently, we are working on extractingread access pattern information from meta-data blocks writ-ten by the relocation mechanism in order to prefetch entireregular files.

Availability

For more information about DualFS, please visit the Website at:http://www.ditec.um.es/˜piernas/dualfs.

References[1] ANDERSON, D. C., CHASE, J. S., AND VAHDAT, A. M.

Interposed request routing for scalable network storage. InProc. of the Fourth USENIX Symposium on Operating Sys-tems Design and Implementation (OSDI) (Oct. 2000), 259–272.

[2] BAKER, M. G., HARTMAN, J. H., KUPFER, M. D.,SHIRRIFF, K. W., AND OUSTERHOUT, J. K. Measurementsof a distributed file system. In Proc. of 13th ACM Symposiumon Operating Systems Principles (1991), 198–212.

[3] BRYANT, R., FORESTER, R., AND HAWKES, J. Filesystemperformance and scalability in linux 2.4.17. In Proc. of theFREENIX Track: 2002 USENIX Annual Technical Confer-ence (June 2002), 259–274.

[4] CARD, R., TS’O, T., AND TWEEDIE, S. Design and imple-mentation of the second extended filesystem. In Proc. of theFirst Dutch International Symposium on Linux (December1994).

[5] CHUTANI, S., ANDERSON, O. T., KAZAR, M. L., LEV-ERETT, B. W., MASON, W. A., AND SIDEBOTHAM, R.The Episode file system. In Proc. of the Winter 1992 USE-NIX Conference: San Francisco, California, USA (January1992), 43–60.

[6] GANGER, G. R., AND KAASHOEK, M. F. Embedded in-odes and explicit grouping: Exploiting disk bandwidth forsmall files. In Proc. of the USENIX Annual Technical Con-ference, Anaheim, California USA (January 1997), 1–17.

[7] HAGMANN, R. Reimplementing the cedar file system usinglogging and group commit. In Proc. of the 11th ACM Sym-posium on Operating Systems Principles (November 1987),155–162.

[8] JFS for Linux. http://oss.software.ibm.com/jfs, 2003.

[9] KATCHER, J. PostMark: A new file system benchmark.Technical Report TR3022. Network Appliance Inc. (october1997).

[10] KROEGER, T. M., AND LONG, D. D. E. Design and imple-mentation of a predictive file prefetching algorithm. In Proc.of the 2001 USENIX Annual Technical Conference: Boston,Massachusetts, USA (June 2001), 319–328.

[11] LEI, H., AND DUCHAMP, D. An analytical approach tofile prefetching. In Proc. of 1997 USENIX Annual TechnicalConference (1997), 275–288.

[12] MATTHEWS, J. N., ROSELLI, D., COSTELLO, A. M.,WANG, R. Y., AND ANDERSON, T. E. Improving the per-formance of log-structured file systems with adaptive meth-ods. In Proc. of the ACM SOSP Conference (October 1997),238–251.

[13] MCKUSICK, M., JOY, M., LEFFLER, S., AND FABRY, R. Afast file system for UNIX. ACM Transactions on ComputerSystems 2, 3 (Aug. 1984), 181–197.

[14] MCKUSICK, M. K., AND GANGER, G. R. Soft updates: Atechnique for eliminating most synchronous writes in the fastfilesystem. In Proc. of the 1999 USENIX Annual TechnicalConference: Monterey, California, USA (June 1999), 1–17.

[15] MULLER, K., AND PASQUALE, J. A high performancemulti-structured file system design. In Proc. of 13th ACMSymposium on Operating Systems Principles (Oct. 1991),56–67.

[16] PIERNAS, J., CORTES, T., AND GARCIA, J. M. DualFS: anew journaling file system without meta-data duplication. InProc. of the 16th Annual ACM International Conference onSupercomputing (June 2002), 137–146.

[17] PIERNAS, J., CORTES, T., AND GARCIA, J. M. Improvingthe performance of new-generation journaling file systemsthrough meta-data prefetching and on-line relocation. Tech-nical Report UM-DITEC-2003-5 (Sept. 2002).

[18] ROSENBLUM, M., AND OUSTERHOUT, J. The design andimplementation of a log-structured file system. ACM Trans-actions on Computer Systems 10, 1 (Feb. 1992), 26–52.

[19] SELTZER, M., BOSTIC, K., MCKUSICK, M. K., AND

STAELIN, C. An implementation of a log-structured file sys-tem for UNIX. In Proc. of the Winter 1993 USENIX Confer-ence: San Diego, California, USA (January 1993), 307–326.

[20] SELTZER, M. I., GANGER, G. R., MCKUSICK, M. K.,SMITH, K. A., SOULES, C. A. N., AND STEIN, C. A. Jour-naling versus soft updates: Asynchronous meta-data protec-tion in file systems. In Proc. of the 2000 USENIX AnnualTechnical Conference: San Diego, California, USA (June2000).

12

[21] SHRIVER, E., GABBER, E., HUANG, L., AND STEIN, C.Storage management for web proxies. In Proc. of the 2001USENIX Annual Technical Conference, Boston, MA (June2001), 203–216.

[22] Linux XFS. http://linux-xfs.sgi.com/projects/xfs, 2003.

[23] SPECweb99 Benchmark. http://www.spec.org, 1999.

[24] SWEENEY, A., DOUCETTE, D., HU, W., ANDERSON, C.,NISHIMOTO, M., AND PECK, G. Scalability in the XFSfile system. In Proc. of the USENIX 1996 Annual TechnicalConference: San Diego, California, USA (January 1996), 1–14.

[25] ReiserFS. http://www.namesys.com, 2003.

[26] TPCC-UVA. http://www.infor.uva.es/˜diego/tpcc.html,2003.

[27] TWEEDIE, S. Journaling the Linux ext2fs filesystem. InLinuxExpo’98 (1998).

[28] Veritas Software. The VERITAS File System (VxFS).http://www.veritas.com/products, 1995.

[29] VOGELS, W. File system usage in Windows NT 4.0. ACMSIGOPS Operating Systems Review 34, 5 (December 1999),93–109.

[30] WANG, J., AND HU, Y. PROFS–performance-oriented datareorganization for log-structured file system on multi-zonedisks. In Proc. of the The Ninth International Symposiumon Modeling, Analysis and Simulation of Computer andTelecommunication Systems (MASCOTS), Cincinnati, OH,USA (August 2001), 285–293.

13

Super-Fast Active Memory Operations-Based Synchronization

Lixin Zhang, Zhen Fang, John B. Carter and Michael A. ParkerSchool of Computing

University of UtahSalt Lake City, UT 84112

lizhang, zfang, retrac, [email protected]

Abstract

Synchronization is a crucial operation in scalable multiprocessors and a fundamental element in parallelprogramming. Various synchronization mechanisms have been proposed over the years and many of themhave enjoyed great success. However, as network latency rapidly approaches thousands of processor cyclesand multiprocessors systems become larger and larger, conventional synchronization techniques are failingto keep up with the increasing demand for fast, efficient synchronization.

In this paper, we propose a mechanism that allows atomic operations on synchronization variables to beexecuted on the home memory controller and the home memory controller to send fine-grained updates towaiting processors when a synchronization variable reaches certain values. Performing atomic operationsnear where the data resides, rather than moving it across the network, operating on it, and moving it back,eliminates significant network traffic, hides high remote memory latencies, and introduces opportunities foradditional parallelism. Using a fine-grained update protocol, rather than a write-invalidate protocol, onsynchronization variables simplifies the programming model, eliminates invalidation requests, and reducesnetwork traffic.

As the first demonstration, we have used a memory-controller-based atomic operation to optimize thebarrier function of an OpenMP library. Our simulation results show that the optimized version outper-forms a conventional LL/SC (Load-Linked/Store-Conditional) version by a factor of 20.8, a conventionalprocessor-based atomic instruction version by 15.5, and an active messages version by 13.4.

Keywords: distributed shared-memory, synchronization, barrier, intelligent memory controller, coherenceprotocol

1 Introduction

Processor speeds are increasing while wall time network latency is decreasing very little due to speed

of light effects. The round trip network latency of large-scale distributed shared-memory (DSM) machines

will soon be thousands of processor cycles. As a result, synchronization overhead has become one of

the major obstacles to the scalability of DSM systems and has limited the sustained performance of DSM

systems. For instance, a barrier, one of the most frequently used synchronization operations, coordinates

1

concurrent processes to guarantee that no process advances beyond a particular point until all processes

have reached that point. Performing a barrier operation on an Origin 3000 system with 32 MIPS R14K

processors requires about 90,000 cycles, during which time the system could execute 5.76 million FLOPS.

The 5.76MFLOPS/barrier ratio is an alarming sign that conventional barrier implementations are hurting the

overall system performance.

A number of barrier solutions have been proposed over the decades. The fastest of all approaches is

to use a pair of dedicated wires between every two nodes [2, 11, 22, 23]. However, such approaches are

only possible for small-scale systems, where the number of dedicated wires is small. For medium-scale

or large-scale systems, the cost of having dedicated wires between every two nodes is prohibiting. For

example, a 128-node system would require 16,256 wires. In addition to the high cost of physical wires,

hardware-wired approaches cannot synchronize more than one barrier at one time and do not interact well

with load-balancing techniques, such as processes migration, where the process-to-processor mapping is not

static.

Most barrier solutions are software-based with limited hardware support. Some processors support atomic

instructions that perform variants of read-modify-write operations, e.g., the semaphore instructions of IA-

64 [7]. Performing atomic semaphore instructions in the processor core complicates the design of the proces-

sor pipeline, load/store units, and system bus. As a simpler alternative, most modern processors, including

AlphaTM [1], PowerPCTM [13] and MIPSTM [8], rely on load-linked/store-conditional (LL/SC) instructions

to implement atomic read-modify-write operations. An LL instruction loads a block of data into the cache.

A subsequent SC instruction attempts to write to the same block. It succeeds only if the block has not been

evicted from the local cache since the preceding LL, which implies that the read-write pair effectively was

atomic. Any intervention request from another processor between the LL and SC instructions causes the

SC to fail. To implement atomic synchronization, library routines typically retry the LL/SC pair repeatedly

until the SC succeeds.

The main drawback of atomic and LL/SC instructions is that they ship data back and forth across the net-

work and system bus for every atomic operation. For instance, in the predominant scalable DSM architec-

tures, directory-based CC-NUMA (cache-coherence non-uniform memory access), each block of memory

is associated with a fixed home node, which maintains a directory structure to track the state of all locally-

homed data. When a processor is about to modify a synchronization variable that has been modified by

another processor, the local DSM hardware sends a message to the barrier’s home node to request a copy

and acquire exclusive ownership. Depending on the dirty copy owner’s location, the home node may need

to send messages to additional nodes to service the request. During a barrier operation, the ownership of the

2

synchronization variable is passed from one processor to another across the network until all participating

processors have atomically incremented the barrier variable.

In an attempt to speed up synchronization operations, we propose adding an Active Memory Unit (AMU)

capable of performing simple atomic operations in the memory controllers (MC). The presence of AMUs

will let processors operate on remote data at the data’s home node, rather than loading it into a local cache,

operating on it, and potentially flushing it back to the home node. We call AMU-supported operations

Active Memory Operations (AMO), because they make the conventional “passive” memory controller

more “active”. The goal of AMOs is to reduce network traffic, avoid the high latency of remote memory

accesses, and ultimately increase the performance of parallel applications.

An existing means of avoiding remote communication is Active Messages [27] (ActMsg). An active

message includes the address of a user-level handler to be executed by the home node processor upon

message arrival using the message body as argument. While active messages are very effective in reducing

communication traffic, using the node’s primary processor to execute the handlers has higher latency than

dedicated hardware and interferes with useful work. With an AMU, similar functionality can be performed in

the MC, which avoids these problems. Although active messages were initially developed for the messaging

passing environment to better overlap communication and computation, we find that they are also effective

in improving the performance of barriers. Thus, we will compare AMOs-based barriers to active-messages-

based barriers in our study.

In general, a barrier synchronization of N processes requires N serial atomic operations. This serialization

exposes the long round trip latency of the network. To enable parallelism during synchronization, some

researchers have proposed barrier trees [5, 21, 28] , which use multiple barrier variables organized as a

tree for one barrier operation. For example, in Yew, Tzeng and Lawrie’s software combining tree [28],

the processors are leaves of the tree and are organized into groups. The last processor to arrive in a group

increments a counter in the group’s parent node. Continuing in this fashion, the last processor reaching the

barrier point works its way to the root of the tree and triggers a reverse wave of wakeup operations to all

processors. Barrier trees achieve significant performance gains on large-scale systems, but they introduce

extra programming complexity. We will later show that AMOs can achieve better performance without

surrendering to this programming complexity.

In the rest of the paper, we present the architecture and software interface for the proposed design and

compare AMOs against LL/SC instructions, processor-side atomic instructions, active messages, and tree-

based implementations of these existing mechanisms. We test the different implementations of the barrier

function of an OpenMP library on an execution-driven simulator. The simulator results show that AMOs

3

outperform LL/SC instructions, atomic instructions, and active messages by up to a factor of 80, with the

geometric mean of 20.8, 15.5, and 13.4, respectively, and their tree-based versions by a factor of 7.8, 7.1

and 6.9, respectively.

The remainder of this paper is organized as follows. Section 2 describes the active memory architectures

and programming models. Section 3 describes the simulation environment and gives the simulator results.

Section 4 surveys related work. Section 5 summarizes our conclusions and discusses future work.

2 AMU-Supported Synchronization

Our proposed mechanism adds an AMU that can perform simple atomic arithmetic operations to the MC,

extends the coherence protocol to support fine-grained updates, and augments the ISA with a few special

AMO instructions that initiate active memory operations. The initial design of the AMU supports three

AMO instructions: amo.inc (increment by one), amo.fetchadd (fetch and add), and amo.compswap

(compare and swap). Semantically, these instructions are just like the atomic instructions implemented

by some microprocessors. Programmers can use them as if they were traditional processor-side atomic

operations.

In the rest of this section, Section 2.1 presents the hardware organization of the AMU. Section 2.2 de-

scribes the fine-grained update protocol. Section 2.3 describes the programming model of AMOs.

Figure 1 depicts the architecture that we assume. A crossbar connects processors to the network back-

plane, from which they can access remote processors, as well as their local memory and IO subsystems,

shown in Figure 1 (a). In our model, the processors, crossbar, and memory controller all reside on the same

die, as will be typical in near-future system designs. Figure 1 (b) is the block diagram of the Active Memory

Controller with the proposed AMU delimited within the dotted box.

When a processor issues an AMO instruction, it examines the target address and sends a request to the

AMU of the target address’ home node. When the message arrives at the AMU of that node, it is placed in

a queue waiting for dispatch. The control logic of the AMU exports a READY signal to the queue when it

is ready to accept another request. The operands are then read and fed to the memory function unit (MFU

in Figure 1 (b)). The old or new value may be returned to the requesting processor, depending on the AMO.

Memory references for synchronization variables exhibit high temporal locality because every participat-

ing process accesses the same synchronization variable. To further improve the performance of AMOs, we

add a very small cache to the AMU. This cache effectively coalesces operations to synchronization variables,

4

xbar

DRAM module

(active) memory controller

& caches

& cachesCPU1 core

CPU2 core

sysF

bus

bus

sysT

(a) Node

DRAM controller

moduleDRAM

normal data path

directory controller

to/from xbar

cache

issue

queue

MFU

AMU

AMU

(b) Active memory unit

Figure 1. Architecture of Active Memory Unit

eliminating the need to load from and write to the off-chip DRAM module every time. Each AMO that hits

in the AMU cache takes only two cycles to complete, independent of the number of nodes contending for

the synchronization variable. The size of the AMU cache is an open design issue. For this study, we assume

a four-entry AMU cache.

AMO instructions operate on coherent data. AMU-generated requests are sent to the directory controller

as fine-grained “get” (for reads) or “put” (for writes) requests 1. A fine-grained “get” loads the coherent

value of a word or a double-word. In our model, the directory controller maintains coherence at the block

level. When it receives a get request from the AMU, it loads the coherent value of the target word either

from a processor cache or local memory, depending on the state of the block containing the word. It then

changes the state of the block to “shared” and adds the AMU to the list of sharers. Unlike traditional data

sharers, the AMU is allowed to modify the word without obtaining exclusive ownership first. The AMU

sends fine-grained “put” requests to the directory controller when it needs to write barrier variables back to1Fine-grained “get/put” operations are planned for a future SGI DSM architecture. Its details are beyond the scope of this paper.

5

local memory. A fine-grained “put” updates a word of a block. When the directory controller receives a

put request, it will send a word-update request to local memory and every node that has a copy of the block

containing the word to be updated.

To take advantage of get/put, the amo.inc instruction includes a “test” value that is compared against

the result of the increment. If the incremented value matches the test value, e.g., as would be the case in a

barrier operation if the test value is the total number of processors expected to reach the barrier, the AMU

sends a put request along with the new value to the directory controller. This update is like a signal to all

waiting processors. It does not have any extra overhead and happens at the earliest time possible.

A potential way to optimize conventional barriers is to use a write-update protocol to eliminate invalida-

tion requests for barrier variables. However, issuing a block update after each increment generates too much

network traffic, offsetting the benefit of eliminating invalidation requests. The get/put mechanism eliminates

write-update’s drawbacks, because it issues word-grained updates (thereby eliminating false sharing) and it

only issues updates after the last process reaches the barrier rather than once each time any processor reaches

the barrier.

One possible problem with using get/put is it introduces temporal inconsistency between the barrier vari-

able values in the processor caches and the AMU cache. In essence, the get/put mechanism implements

release consistency for barrier variables, where the act of reaching a “test” values acts as a release point.

For barrier operations, release-consistency is a completely acceptable memory model, but we must be extra

careful when applying AMOs to applications where release-consistency might cause problems.

The first time a synchronization variable is accessed, the AMU obtains the latest copy from either a

processor cache or local DRAM. All subsequent AMOs to the same synchronization variable will find the

data in the AMU cache and thus require only two cycles to process. The synchronization variable will stay

in the AMU cache until it is evicted out or the directory controller receives a request for exclusive ownership

for the block containing the variable.

AMO instructions are encoded in an unused portion of the MIPS-IV instruction set space, because our

target implementation of AMOs is an as-yet-unannounced future member of the MIPS processor family.

We are considering a wide range of AMO instructions, but for this study, we focus only on amo.inc. The

amo.inc instruction increments the target memory location by one and returns the original memory content

to a general register. It can be used to implement efficient barriers. Its mnemonic form is amo.inc rd,

rs1, rs2, where rd is the destination register to store the return value, rs1 holds a memory address,

6

and rs2 holds the “test” value. The test value is set as the expected value of the barrier variable after all

processes have reached the barrier point.

Before we describe how to use amo.inc to implement barriers, let us first investigate how barriers

are implemented. The following is a naive barrier implementation in which num procs is the number of

participating processes:

atomic_add(& barrier_variable);spin_until(barrier_variable == num_procs);

This implementation is inefficient because it directly spins on the barrier variable. Since processes that

have reached the barrier repeatedly try to read a copy of the barrier variable into their local caches to test if

it has reached the expected value (num procs in this case), the next increment attempt by another process

will have to compete with these read requests, possibly resulting in a very long latency for the increment

operation. Although processes that have reached the barrier can be suspended to avoid interference with the

subsequent increment operations, the overhead of suspending and resuming processes is too much to have

an efficient barrier.

A common optimization to the naive version is to spin on another variable, such as the following.

int expected_value = barrier_counter + num_procs,if (atomic_add_one(&barrier_variable) + 1 == expected_value)

barrier_counter = expected_value;else

spin_until(barrier_counter == expected_value);

The code segment above assumes that atomic add one() returns the value of the barrier variable

before increment. Instead of spinning on the barrier variable, this loops spins on another variable bar-

rier counter. For simplicity, we call barrier counter the “spin” variable. The coding above is

slightly different than the naive coding because it has been extended so that it can be used for more than

one barrier operations after initialization as long as barrier counter and barrier variable do

not overflow. Because data coherence is maintained in block-level, for this coding to work properly, pro-

grammers must make sure that the barrier variable and the spin variable do not reside in the same block.

Using the spin variable eliminates the interference between spin and increment operations. However, it

introduces some extra overhead — a write to the spin variable for each synchronization. A write to the

spin variable involves the home node sending an invalidation request to every processor before the write

occurs and each processor reloading a copy of the spin variable after the write has completed. Introduction

of the spin variable is undesirable because it is not an element in the algorithm but simply a programming

technique of trading programming complexity for performance. Nevertheless, the benefit of using the spin

7

variable often overwhelms its overhead. Nikolopoulos and Papatheodorou [18] have demonstrated that

spinning on the spin variable is 25% faster than spinning on the barrier variable for a synchronization across

64 processors.

With AMOs, atomic increment operations are performed without invalidating shared copies in the pro-

cessor caches and the shared copies in the processor caches are automatically updated when all processes

reach the barrier. Consequently, AMO-based barriers can use the naive coding, i.e., directly spinning on the

barrier variable. The equivalent AMO-based version of barrier looks like the following.

int expected_value = (barrier_variable / num_procs + 1) * num_procs;amo_inc(&barrier_variable, expected_value);spin_until(barrier_variable == expected_value);

In this version, amo inc() is a wrapper function for the amo.inc instruction. Before calling it, the

program must first figure out the expected barrier value after all processes reach the barrier. For simplicity,

we require the initial value of the barrier variable to be a multiple of num procs, such as zero. With

that warrant, the expected value can be computed as ((barrier variable / num procs + 1) *

num procs).

Conventional processor-based atomic instructions often require significant effort from programmers to

write correct and efficient synchronization codes. For example, the Alpha Architecture Handbook [1] uses

three pages to explain each of the LL/SC instructions along with many restrictions. Processor-side imple-

mentation of atomic instructions, like fetchadd and cmpxchg on Itanium2TM [7], need to lock many system

resources to ensure atomicity. This leads to subtle situations that can cause livelocks, e.g., when stores to the

Write Coalescer are closely followed by semaphore instructions. By contrast, AMO instructions are inter-

ference free, do not lock any system resources to ensure atomicity and eliminate the need for programmers

to be aware how the atomic instructions are implemented. Since synchronization-related codes are often the

hardest parts of a parallel application to code and debug, simplifying the programming model is a significant

advantage of AMOs over other mechanisms, in addition to their performance superiorities.

3 Evaluation

This section presents our experimental environment and results. We describe the simulation environment

in Section 3.1, delineate the different barrier synchronization functions that we have tried in Section 3.2, and

present the simulation results in Section 3.3.

8

Parameter Value

Processor four-way issue, 48-entry active list, 2 GHZL1 I-cache two-way, 32KB, 64-byte lines, 1-cycle latencyL1 D-cache two-way, 32KB, 32-byte lines, 2-cycle latencyL2 cache four-way, 2MB, 128-byte lines, 10-cycle latencySystem bus 16-byte CPU to system, 8-byte system to CPU

maximum 16 outstanding cache misses, 1 GHZDRAM backend 16 16-bit-data DDR channelsHub clock rate 500 MHZDRAM latency 60 processor cyclesNetwork hop latency 100 processor cycles

Table 1. Configuration of the simulated systems.

We use the execution-driven simulator UVSIM in our performance study. UVSIM models a future-

generation SGI DSM supercomputer, including a directory-based SN2-MIPS-like coherence protocol [24]

that supports both write-invalidate and write-update. The write-update protocol is fine-grained, meaning

that an update request may replace as little as a single word in a cache line, as described in Section 2.2. Each

simulated node contains two MIPS R18000-like [17] microprocessors connected to sysTF busses [25]. Also

connected to the bus is SGI’s future-generation Hub [25], which contains the processor interface, memory

controller, directory controller, network interface, IO interface, and active memory unit.

UVSIM has a micro-kernel that supports most common system calls. It directly executes statically linked

64-bit MIPS-IV executables. UVSIM supports the OpenMP runtime environment. All benchmark programs

used in this paper are OpenMP-based parallel programs.

Table 1 lists the major parameters of the simulated systems. The L1 cache is virtually indexed and

physically tagged. The L2 cache is physically indexed and physically tagged. The DRAM backend has 16

20-bit channels connected to DDR DRAMs, which enables us to read an 80-bit burst every two cycles. Of

each 80-bit burst, 64 bits are data. The remaining 16 bits are a mix of ECC bits and partial directory state.

The simulated interconnect subsystem is based on SGI’s NUMALink-4. The interconnect is built using

16-port routers in a fat-tree structure, where each non-leaf router has eight children. We model a network

hop latency of 50 nsecs (100 cpu cycles). The minimum network packet is 32 bytes.

We have validated the core of our simulator by setting the configurable parameters to match those of an

SGI Origin 3000, running a large mix of benchmark programs on both a real Origin 3000 and the simulator,

9

Name Description

Baseline uses a pair of LL/SC instructionsActMsg uses active messagesAtomic uses an atomic increment instructionAMO uses the amo.inc instruction

Table 2. Implementations of the barrier function.

and comparing performance statistics (e.g., run time, cache miss rates, etc.). The simulator-generated results

are all within 20% of the corresponding numbers generated by the real machine, most within 5%.

For the initial study of AMOs, we use amo.inc to optimize the barrier function from an OpenMP

library and compare AMOs with LL/SC instructions, processor-side atomic instructions, and active mes-

sages. Table 2 lists the different versions of the barrier function. We use the slowest, and the most widely

used implementation — the LL/SC version — as the baseline. In addition, we employ the software barrier

combining tree based on the work by Yew et al. [28] to each implementation.

We have extended UVSIM to support active messages. In our model, both AMOs and active messages

share the same programming model; the only difference is the way each is handled by the receiving node.

We assume user-level interrupts are supported for active messages on both the recipient processor and the

initiating processor, so our results for active messages are somewhat optimistic.

All programs in our study are compiled using the MIPSpro Compiler 7.3 with an optimization level of “-

O3”. All results in this section come from complete simulations of the benchmark programs. Each program

invokes the synchronization function 11 times. The first call is used to synchronize the starting point of each

process. We then divide the wall time for the following ten invocations by ten, report this time as the elapsed

time for a barrier operation.

Non-tree-based barriers

Table 3 presents the speedups of different barrier implementations over the baseline implementation. The

first column of this table contains the numbers of processors being synchronized. We vary the number of

processors from four (i.e., two nodes) to 256, the maximum number of processors allowed by the directory

10

CPUs Speedup over BaselineAtomic ActMsg AMO

4 1.25 1.16 3.068 1.27 1.57 9.26

16 1.25 1.36 14.0432 1.27 1.53 23.7164 1.37 1.69 42.59

128 1.37 1.81 50.59256 1.41 1.86 82.30

Table 3. Performance of different barriers.

4 8 16 32 64 128 256

Number of Processors

100

500

1000

1500

2000

2500

3000

Cyc

les

/ Pro

cess

or

BaselineAtomic ActMsgAMO

Figure 2. Cycles-per-processor of different barriers.

structure of the SN2-MIPS protocol [24]. The Atomic, ActMsg, and AMO barrier implementations all

achieve significant speedups over the baseline LL/SC version. Specifically, conventional atomic instructions

outperform LL/SC by a factor of 1.25 to 1.41. Active messages outperform LL/SC by a factor 1.16 to 1.86.

However, AMO-based barrier performance dwarfs that of all other implementations, ranging from a speedup

of 3.06 for four processors to as high as 82.30 for 256 processors.

In the baseline implementation, each processor loads the barrier variable into its local cache before in-

crementing it using LL/SC instructions. Only one processor will succeed at one time; the LL/SCs on the

other processors will fail. After a successful update by a processor, the barrier variable will move to another

processor, and then to another processor, and so on. As the system grows, the average latency to move the

barrier variable between processors increases, as does the amount of contention. As a result, the synchro-

nization time in the base version increases superlinearly as the number of nodes increases. This effect can

be seen particularly clearly in Figure 2, which plots the per-processor barrier synchronization time for each

barrier implementation.

11

The conventional atomic instruction implementation of barriers eliminates the failed SC attempts in the

baseline version, and thus improves upon the baseline by more than 25% in all cases. However, its perfor-

mance gains are relatively small compared to the other implementations that we considered, because it still

requires a round trip over the network for every atomic operation, all of which must be performed serially.

For the active message version, an active message is sent for every increment operation. The overhead

of invoking the active message handler for each increment operation dwarfs the time required to run the

handler itself. Nonetheless, the benefit of eliminating remote memory accesses outweighs the high invoca-

tion overhead, which results in the active message version of barriers outperforming the baseline version for

all systems by from 16% to 86%. In addition, the high overhead of each invocation results in a significant

number of messages being queued, which leads to message timeouts and retransmissions. These drawbacks

make active messages perform relatively poorly comparing to AMOs.

The AMO implementation of barriers uses the special amo.inc instruction. After issuing an AMO, each

process spins on the barrier variable, waiting for it to reach the prescribed value. When the first amo.inc

operation arrives at the barrier variable’s home AMU, local DRAM or a processor cache is accessed to load

the barrier variable into the AMU’s cache. All subsequent AMOs find the data in the AMU cache and thus

require only two cycles to process. Thus, the bottleneck for AMO performance is the time to transfer the

request from the network backplane. After the barrier variable reaches a specified value, an update is sent

to every node that has a copy of the variable. Since a local process is spinning on the data, it is very likely

that every node will have a copy of the barrier variable in its local caches. Thus, the total cost of sending

update requests is close to the time required to send a single update request multiplied by the number of

participating processors. 2

Roughly speaking, the time to perform an AMO-based barrier equals ( ), where is a fixed

overhead and is a small value related to the processing time of an amo.inc operation and an update

request, and is the number of processors being synchronized. This expression implies that AMO-based

barriers scale well, which is clearly illustrated by the performance numbers in Figure 2. This figure shows

that the per-processor latency of AMO-based barriers is constant with respect to the total number of pro-

cessors. In fact, the per-processor latency drops off slightly as the number of processors increases for

AMO-based barriers because the fixed overhead is amortized by more processors. Contrast this with the

other implementations, where the larger the number of nodes in the system, the higher the per-processor

synchronization time.2We do not assume that the network can physically multicast updates; AMO performance would be even higher if the network

supported such operations.

12

CPUs Speedup over BaselineBaseline+tree Atomic+tree ActMsg+tree AMO+tree AMO

8 1.04 1.11 1.19 1.32 9.2616 2.06 2.17 2.08 2.10 14.0432 2.42 2.97 3.57 3.82 23.7164 4.80 5.46 6.74 9.06 42.59

128 6.28 6.17 8.16 11.35 50.59256 10.03 12.08 13.51 14.90 82.30

Table 4. Performance of tree-based barriers.

8 16 32 64 128 256


50

100

150

200

250

300

350

400

450

500

550

Cyc

les

/ Pro

cess

or

Baseline + tree Atomic + treeActMsg + treeAMO + treeAMO

Figure 3. Cycles-per-processor of tree-based barriers.

Tree-based barriers

For all tree-based barriers, we use a two-level tree structure regardless of the number of processors. For each

configuration, we try all possible tree branching factors and use the one that delivers the best performance.

The initialization time of the tree structures is not included in the reported results. Since tree barrier algo-

rithms are designed for non-trivial system sizes, the smallest configuration that we consider in this study is

8 processors. Table 4 shows the speedups of tree-based barriers over the original baseline implementation.

Figure 3 shows the number of cycles per processor for the tree-based barriers.

Our simulation results clearly indicate that tree-based barriers perform much better and scale much better

than normal barriers, which concurs with the findings of Michael et al. [15]. Tree-based barriers create

parallelism and reduce the hot-spot contention, so they are every effective at speeding up normal barriers.

On a 256-processor system, all tree-based barriers are at least ten times faster than the baseline barrier.

As seen in Figure 3, the number of cycle-per-processor for tree-based barriers decreases as the number

of processors increases, because the high overhead associated with using trees is amortized across more

processors and increments to different parts of the tree can proceed in parallel.

13

A tree-based barrier on a large system is basically a combination of several barriers on smaller systems.

Traditional barriers have similar performance on small-scale systems, so the differences among traditional

tree-based barriers are rather small.

One reason that tree-based barriers perform uniformly well is because we have tried different tree branch-

ing factors and present the best result for each given system. In a real system, we might not have the luxury

of choosing the optimal branching factors and tree re-construction could add non-negligible overhead. The

best branching factor for a given system is often not intuitive. Markatos et al. [12] have demonstrated that

in some multiprogramming environments, improper use of trees can drastically degrade the performance of

tree-based barriers to even below that of simple centralized barriers. Nonetheless, our simulation results do

demonstrate of the performance potential of combining tree-based barriers.

Even with all of the advantages of tree structures, tree-based barriers are still significantly slower than

AMO-based barriers. For instance, the best tree-based barrier (ActMsg + tree) is more than six times slower

than the AMO-based barrier on a 256-processor system. This result is because tree-based barriers involve

higher overhead and generate more network traffic than AMO-based barriers.

Interestingly, the combination of AMOs and trees performs worse than AMOs alone for this size configu-

ration. The cost of an AMO-based barrier includes a large fixed overhead and a very small number of cycles

per processor. Using tree structures with AMO-based barriers essentially introduces the fixed overhead more

than once, therefore resulting in a longer barrier synchronization time. That AMOs alone are better than the

combination of AMOs and trees for this configuration is another indication that AMOs do not require heroic

programming effort to achieve good performance. However, this relationship between normal AMOs and

tree-based AMOs might change if we move to systems with tens of thousands processors. Determining

whether or not tree-based AMO barriers can provide extra benefits on such extremely large systems is part

of our future work.

4 Related Work

Physical-wire-based barriers have had great success in small- to medium-scale systems. Lundstrom [11]

proposes an ANDing network for barrier synchronization. Shang et al. [23] propose a distributed wired-

NOR network. Cray T3D [2] implements barriers with hierarchically constructed AND trees and fanout

circuits. Cray T3E [22] implements barrier trees using dedicated barrier/Eureka synchronization units, but

without using dedicated physical wires.

The fetch-and-add instructions in the NYU Ultracomputer [4, 9] are similar to AMOs in that both are

implemented in the memory controller. The Ultracomputer’s fetch-and-add uses a combining network that

14

tries to combine all loads and stores for the same memory location within the routers. It has the potential

drawback of delaying requests that do not use this feature. Combining is effective only when many fetch-

and-add instructions are issued over a very short time interval. In contrast, our approach performs the

combining at the barrier variable’s home memory controller, which eliminates the closeness-in-time problem

associated with combining in the network, and requires only two DRAM access per barrier regardless of the

number of processes.

The SGI Origin 2000 [10] and Cray T3E’s [22] atomic memory operations come closest to our work.

Like AMOs, atomic memory operations are implemented at the memory controller. Processors trigger

atomic memory operations by sending requests to special IO addresses on the home memory controller

of the atomic variable. The home MC interprets those special requests as commands to perform simple

atomic operations. With atomic memory operations, programs can either use uncached loads or, for better

performance, use spin variables during spinning. AMOs do not require spin variables, and thus are superior

to atomic memory operations in both performance and programming convenience. Comparing AMOs with

atomic memory operations is part of future work.

Michael et al. [16] have studied several implementations of the LL/SC instructions. In their studies,

computation power is placed at different levels of the memory hierarchy. They do so by carefully choosing

coherence policies for synchronization operations. For the computation-in-memory environment, a cen-

tralized reservation mechanism is used for granting the exclusive ownership to one of the processors that

attempt to do SC on a synchronization variable. This mechanism is similar to the LLbit approach in SGI

MIPS processors [17]. It needs to maintain a bit vector or a counter for each memory location that the

synchronization variable may reside. Its performance gain is likely smaller than AMOs’ for three reasons:

(i) every LL or SC must go to memory for the bit vector or counter; (ii) SC attempts fail frequently due to

heavy contention; and, (iii) updates that are sent to the sharers after each successful write generate too much

traffic and jam the interconnection network.

Adding intelligence to the memory controller is an approach taken by several researchers working to

overcome the “Memory Wall” problem [3, 14, 29]. Several researchers propose to incorporate processing

power on DRAM chips, so-called PIM (processor-in-memory) systems [6, 19, 20, 26]. Unlike PIMs, we do

not propose placing processors on the DRAM chips themselves. We believe that non-commodity DRAMs,

with their expected lower yields, lower density, and higher per-byte costs, are not well-suited for cost-

effective large-scale machines. Further, the performance of the PIM architectures is somewhat limited by

the fact that DRAM processes are slower than logic processes, and it is difficult to enforce data coherence

in PIM architectures.

15

5 Conclusions and Future Work

Efficient synchronization is crucial operation for effective paralle programming of large multiprocessor

systems. As network latency rapidly approaches thousands of processor cycles and multiprocessors sys-

tems are becoming larger and larger, synchronization speed is quickly becoming a significant performance

determinant.

In this paper, we propose a mechanism that uses special atomic active memory operations performed by

the home node memory controller of a synchroniation variable to achieve super-fast barrier synchroniza-

tion. AMO-based barriers do not require extra spin variables or complicated tree structures to achieve good

performance. AMOs enable extremely efficient barrier synchronization at rather low hardware cost, with a

simple programming interface. Our initial simulation results indicate that AMOs are much more efficient

than the traditional LL/SC instructions, atomic instructions, and active messages, outperforming them by up

to a factor of 80.

AMU development is still in its infancy. Part of our future work is to compare AMOs to the other forms

of off-loaded atomic memory operations described in Section 4. We also need to test the sensitivity of

AMOs to network latency and investigate how to use AMOs to improve other synchronization primitives,

e.g., spinlocks and wait/signal. We are also considering how and whether to expand the AMU to support

more complicated operations like floating-point operations, vector-like operations that operate on a range of

data elements, and collective communication operations. For data with insufficient reuse to warrant moving

it across the network, using AMOs may significantly reduce network traffic, hide the long network latency,

and thus improve the overall performance of a variety of parallel operations.

References

[1] Compaq Computer Corporation. Alpha architecture handbook, version 4, Feb. 1998.

[2] Cray Research, Inc. Cray T3D systems architecture overview, 1993.

[3] W. Donovan, P. Sabella, I. Kabir, and M. Hsieh. Pixel processing in a memory controller. IEEE Graphics andApplications, 15(1):51–61, Jan. 1995.

[4] A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU multicomputer -designing a MIMD shared-memory parallel machine. IEEE TOPLAS, 5(2):164–189, Apr. 1983.

[5] R. Gupta and C. Hill. A scalable implementation of barrier synchronization using an adaptive combining tree.International Journal of Parallel Programming, 18(3):161–180, June 1989.

[6] M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, J. Brockman, A. Srivastava,W. Athas, and V. Freeh. Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. InSC’99, Nov. 1999.

[7] Intel Corp. Intel Itanium 2 processor reference manual. http://www.intel.com/design/itanium2/manuals/25111001.pdf.

[8] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, 1992.

16

[9] C. Kruskal, L. Rudolph, and M. Snir. Efficient synchronization on multiprocessors with shared memory. ACMTOPLAS, 10(4):570–601, October 1988.

[10] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In ISCA97, pp. 241–251, June1997.

[11] S. Lundstrom. Application considerations in the system design of highly concurrent multiprocessors. IEEETrans. on Computers, C-36(11):1292–1390, Nov. 1987.

[12] E. Markatos, M. Crovella, P. Das, C. Dubnicki, and T. LeBlanc. The effect of multiprogramming on barriersynchronization. In Proc. of the 3rd IEEE Symposium on Parallel and Distributed Processing, pp. 662–669,Dec. 1991.

[13] C. May, E. Silha, R. Simpson, and H. Warren. The PowerPC Architecture: A Specification for a New Family ofProcessors, 2nd edition. Morgan Kaufmann, May 1994.

[14] S. McKee and et al. Design and evaluation of dynamic access ordering hardware. In Proc. of the 1996 ICS, pp.125–132, May 1996.

[15] M. M. Michael, A. K. Nanda, B. Lim, and M. L. Scott. Coherence controller architectures for SMP-basedCC-NUMA multiprocessors. In Proc. of the 24th ISCA, pp. 219–228, June 1997.

[16] M. M. Michael and M. L. Scott. Implementation of atomic primitives on distributed shared memory multipro-cessors. In Proc. of the First HPCA, pp. 222–231, Jan. 1995.

[17] MIPS Technologies, Inc. MIPS R18000 Microprocessor User’s Manual, Version 2.0, Oct. 2002.

[18] D. S. Nikolopoulos and T. A. Papatheodorou. The architecture and operating system implications on the per-formance of synchronization on ccNUMA multiprocessors. International Journal of Parallel Programming,29(3):249–282, June 2001.

[19] M. Oskin, F. Chong, and T. Sherwood. Active pages: A model of computation for intelligent memory. In Proc.of the 25th ISCA, pp. 192–203, June 1998.

[20] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keaton, C. Kozyrakis, R. Thomas, and K. Yelick. A casefor Intelligent RAM: IRAM. IEEE Micro, 17(2), Apr. 1997.

[21] M. Scott and J. Mellor-Crummey. Fast, contention-free combining tree barriers for shared memory multiproces-sors. International Journal of Parallel Programming, 22(4), 1994.

[22] S. Scott. Synchronization and communication in the T3E multiprocessor. In Proc. of the 7th ASPLOS, Oct. 1996.

[23] S. Shang and K. Hwang. Distributed hardwired barrier synchronization for scalable multiprocessor clusters.IEEE Trans. on Parallel and Distributed Systems, 6(6):591–605, June 1995.

[24] Silicon Graphics, Inc. SN2-MIPS Communication Protocol Specification, Revision 0.12, Nov. 2001.

[25] Silicon Graphics, Inc. Orbit Functional Specification, Vol. 1, Revision 0.1, Apr. 2002.

[26] T. Sunaga, P. M. Kogge, et al. A processor in memory chip for massively parallel embedded applicatiions. IEEEJournal of Solid State Circuits, pp. 1556–1559, Oct. 1996.

[27] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active messages: A mechanism for integrated commu-nication and computation. In Proc. of the 19th ISCA, pp. 256–266, May 1992.

[28] P. Yew, N. Tzeng, and D. Lawrie. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans.on Computers, C-36(4):388–395, Apr. 1987.

[29] L. Zhang, Z. Fang, M. Parker, B. Mathew, L. Schaelicke, J. Carter, W. Hsieh, and S. McKee. The impulsememory controller. IEEE Trans. on Computers, 50(11):1117–1132, Nov. 2001.

17

Dynamic Code Repositioning for Java

Shinji Tanaka, Tetesuyasu Yamada, and Satoshi ShiraishiNetwork Service Systems Laboratories

NTT9-11 Midori-Cho, 3-Chome

Musashino-Shi, Tokyo, Japanemail: tanaka.shinji, yamada.tetsuyasu, [email protected]

Abstract

The sizes of recent Java-based server-side applications, likeJ2EE containers, have been increasing continuously. Pasttechniques for improving the performance of Java applica-tions have targeted relatively small applications. Moreoverthe frequency of invoking the methods of a target applica-tion is not normally distributed. As a result, these tech-niques cannot be applied efficiently to improve the perfor-mance of current large applications. We thus propose a dy-namic code repositioning approach to improve the hit rates ofinstruction caches and TLBs. Profiles of method invocationsare collected when the application performs with its heav-iest processor load, and the code is repositioned based onthese profiles. We also discuss a method splitting techniqueto significantly reduce the sizes of methods. Our evaluationof a prototype implementing these techniques, indicated animprovement 5% in the throughput of the application.

keyword: Java, Code repositioning, Dynamic compiler,

Cache and TLB performance, Large-scale application.

1 Introduction

Recently, the use of Java has been spreading widely ,especially among server-side systems, providing web ser-vices. The code size of applications running on Javavirtual machines (JavaVMs) on these kinds of servers istypically large. Current JavaVMs utilize dynamic com-pilers like Hotspot [8] and JIT [11] [6]. These compilersconvert bytecodes into native instructions and performvarious optimizing techniques, including inline methodexpansion. They do not, however, consider the totalfootprint of an application’s working set of code. Unfor-tunately, large footprints reduce the hit rates of instruc-tion caches and translation look-aside buffers (TLBs).

Current processors have very long pipelines to improvetheir performance, so that a cache miss hit causes a crit-ical latency. These processors include mechanisms toimprove cache hit rates, such as branch prediction, in-struction prefetch, and so on, but these mechanisms arelimited to certain processes.

Many researchers have focused on improving the hitrates of instruction caches and TLBs from the applica-tion’s point of view. Pettis [9] pioneered such code repo-sitioning techniques. The goal of code repositioning is

to minimize the footprint of an application’s working setof code, thus improving the hit rates of the instructioncaches and the TLBs. This work has focused on appli-cations written in Fortran and Pascal, while Tamaches[12] have focused on kernel functions.

There are also numerous techniques for optimizingJava applications dynamically [1] [2] [10]. We thus ap-plied a code repositioning technique for a JavaVM, andwe set up an application server as the target server-sideapplication.

The novel contributions of this paper are the following.

Profiling Algorithm We describe how to gather theprofile information used in the a code repositioningtechnique. Because the life cycles of large applica-tions like those running on application servers haveseveral stages, (i.e. startup, warm up, and stableprocessing), the profile information should be theone gathered in the stage with the heaviest proces-sor loads.

Evaluation We evaluated our prototype designed forthe IBM JIT, a widely used, high-performingJavaVM. Our evaluation, indicated a 5% improve-ment in the throughput of a target application run-ning on the application server.

The rest of this paper is organized as follows. Section 2describes the dynamic code repositioning approach. Sec-tion 3 discusses the prototype evaluation of our proto-type, and Section 4 presents our conclusions.

2 Dynamic Code Repositioning

Current JavaVMs compile methods that are frequentlyinvoked. A typical JavaVM has a counter for eachmethod, and increments it when the method is invoked.When a counter exceeds a threshold, the JavaVM com-piles the method corresponding to the counter. There-fore, the order in which methods are compiled is notrelated to the order of invocation, and also frequentlyexecuted code and rarely executed code are intermixedwithin individual methods. This approach results in lowhit rates for instruction caches and TLBs.

1

Our proposal is to increase the hit rates of instructioncaches and TLBs by repositioning the compiled code.The principles of code repositioning are as follows.

1. Decrease the distance between frequently calledmethods and the corresponding frequent callermethods.

2. Put rarely executed code in methods farther away.

With code repositioning, the numbers of instructioncache lines stored in caches and memory pages occupiedby frequently executed code can be reduced. In thissection, we discuss how to implement these principles.

2.1 Online Profiling

To determine an optimized positioning for applicationcode, statistical profiles of the frequencies of methodinvocation are useful, because frequently invoked callerand callee methods should be placed close to each other.This section describes how profiles for application codeare aggregated and how the application code is reposi-tioned.

2.1.1 Life Cycle of an Application

Our target application, an application server, has a lifecycle, which consists of a start-up period initializing thesystem, a warm up period with few requests, and astable processing period with more numerous requests.Therefore, the processor load for processing requests andthe frequency of compiling application code dynamicallyvary with the stage of the system is in its life cycle.

Startup Warmup Stable

Time

Processor load

Compiling frequency

Figure 1: Life cycle of the application server

Figure 1 shows the life cycle of the system. The x-axisrepresents time, and the y-axis represents the processorload and the frequency of dynamic compiling. The threestages in the life cycle are described in more detail asfollows.

Start-up The first stage is start up, in which the appli-cation loads its configuration and components, ini-tializes its subsystems, and activates them. Becausemost of the application code executed in this stageapplies only to initialization, there is little benefitin optimizing it.

Warm-up The second stage is warm-up, in which theapplication receives only a small number of requests.The throughput at the beginning of this stage is rel-ativity low, because most of the code for processingrequests has not yet been compiled or placed un-der dynamic compiling. As time passes, the rateof compiling code for processing requests increases,along with the throughput. In contrast, the rate ofdynamic compiling decreases with time.

Stable processing The last stage is stable processing,in which the application processes a large numberof requests and its throughput is maximal. Mostof the application code executed in this stage hasalready been compiled, and the dynamic compilerof the JavaVM is rarely executed.

To increase the maximum throughput, the optimiza-tion targets for dynamic code repositioning are the por-tions of code executed in the stable processing periodwith a high processor load and low frequency of dynamiccompilation.

2.1.2 Execution Profiling

To acquire a profile during the stable processing stage,a thread, called control thread, periodically observes thefrequency of compilation. The JIT increments a com-pilation counter when it compiles codes dynamically. Ifthe control thread finds that the counter is less than thethreshold, it starts a profiling process. If not, it sleepsfor a while.

The profiling process has three stages as follows.

Initialization The control thread inserts code invokinga profiling method into the prologue code of all thecompiled methods. Then the control thread sleepsfor the duration of the profiling period, and the com-piled code collects profile information on its own.

Profiling The profiling code is invoked with argumentsconsisting of the program counter of the callermethod and the profile information of the calleemethod. The profiling code includes a profiling ta-ble, which contains a pair consisting of the programcounter of the caller method and the unique ID ofthe callee method, and a counter for the numberof invocations. When invoked, the profiling codesearches the profiling table for the caller method’sprogram counter and the callee method’s ID. If theyare not found, the profiling code registers the newpair in the profiling table. If they are found, it in-crements the pair’s counter.

Finalization After a period of profiling, the controlthread awakes and reinserts the prologue code tothe original code, in order to stop profiling.

The more samples are gathered, the more accurate theinformation. If the number of samples gathered during

2

the profiling period is below the threshold, the systemdoes not optimize the code but repeats the profiling pro-cess instead.

2.2 Code Repositioning

After the profiling process, the system decides whichmethods should be placed with corresponding algorithmsand where, based on a “closest is best” strategy[9]. Thissection describes how the system determines the posi-tioning of methods and repositions them.

2.2.1 Dynamic Call Graph

After the system gathers a sufficient number of samples,it creates an undirected weighted dynamic call graphbased on the profiling table, which stores the results ofthe profiling process (Figure 2). Each node of the graphrepresents a caller or callee method, and each edge rep-resents a call between two methods. The weight of theedge represents the number of times the call was made.In this graph recursive calls are ignored, because theycannot be optimized by repositioning code.

2.2.2 Sorting by Invocation Frequency

E

100

200

D

CBA

1

1000400

(1)

E

300 CB,DA

1

400

(2)

300 CB,DA,E 1

(3)

C(A,E),(B,D)

1

(4)

((A,E),(B,D),C)

(5)

Figure 2: Dynamic call graph

Because our algorithm is based on a “closest is best”strategy[9], it first chooses the edge with the heaviestweight in the graph. If multiple edges have the sameweight, one is arbitrarily chosen. The two methods cor-responding to the two nodes connected by the chosenedge will be placed next to each other in the final po-sition of the code. Then, the two nodes in the graphare merged into one node. If these nodes have edges tosame node, these edges are also merged into one edgewhose weight is the sum of the weights of each node.When the nodes are merged, the new node representingthe merge is attached to a binary tree (Figure 3). Thenew node includes information consisting of the weights

of the edge between the original nodes, and the originalnodes become child nodes of the new node. This processis repeated until the whole graph contains no edges.

Because every node in the dynamic call graph can betraced from a node representing “main” method, thatwas invoked by the JavaVM at first, the graph is a con-nected graph. Therefore the graph will contain only asingle node at the final step.

(a) (b)

(c)

(d)

B D B D A E

A EB D

A EB D

1,000 4001,000

1,000 400

300

1,000 400

300 C

1

Figure 3: Binary tree

Finally, the position of the methods is determined bya depth-first search, that is a algorithm which traces aheavier node first. The algorithm recognizes each leafrepresenting a method is lighter than a node representinga merged node. If both child nodes represent methods,one is arbitrarily chosen.

Figures 2 and 3 show examples of dynamic call graphsand the corresponding binary trees, respectively. Theedge connecting methods B and D has the heaviestweight in the graph (1) of Figure 2, so methods B and Dwill be placed next to each other in the final positions ofthe methods. Then, a node representing this merge witha weight of 1000 is attached to the binary tree shown in(a) of Figure 3 and nodes representing methods B and Dare attached to the merge node. This process is repeateduntil the graph has no edges.

After inserting the merge nodes, the binary tree istraced by a depth-first search. The final positions of

3

methods becomes B-D-A-E-C in this example.

2.2.3 Repositioning

In a typical application, many methods are invoked onlyfew times. To reduce the cost of dynamic code repo-sitioning, these edges for these less frequent calls areremoved from the set of target methods. The controlthread removes methods whose ratios of method invo-cation to total invocations are below a threshold aftercalculating the positions of methods. Finally the systemrecompiles the methods in the sorted order and insertsthe native codes.

2.3 Method splitting

Both frequently executed code and rarely executed codeoften exist within one method. It is thus more efficient toput frequently executed code close to a caller method andrarely executed codes farther away. The system treatstwo types of codes, exception code and inline compen-sating code, as rarely executed, and it puts them in arare code region, which is separate from the usual regionin which frequently executed code is placed.

2.3.1 Exception Code Splitting

Exception mechanisms characterize Java. The exceptioncode contains the control flows that create and throw ex-ception objects and the control flows that catches excep-tion objects and return them to the original sites afterthey are processed. These portions of code are executedrelatively rarely. The system analyzes the control flowsof methods and finds those flows that process exceptionsand create exception handlers. The system treats the ba-sic blocks of these control flows as rarely executed code.

2.3.2 Inline Compensating Code Splitting

The IBM JIT optimizes code by inline method expan-sion. While the application code is running, it looks for-ward to find non-overwritten methods among the com-piled methods by analyzing a class hierarchy. It per-forms inline method expansion for the methods foundand creates compensating code. The compensating codeperforms the original virtual method calls, and then re-turns to the original control flow.

If the expanded inline methods are overwriten, theJIT replaces the expanded inline with code calling thecompensating code. With this algorithm, the JIT canoptimize the virtual method calls and ensure correct be-havior.

The system treats the compensating code as rarelyexecuted.

3 Evaluation

This section gives the results for an experimental eval-uation of a prototype implementation of the code repo-

sitioning technique here. The prototype implementationwas based on IBM JDK 1.3.0[11]. The primary goals ofour evaluations were to determine whether code repo-sitioning could achieve significant improvement in per-formance, and to analyze the system’s behavior at theprocessor level.

3.1 Environment

All the experiments were performed on a 1 GHz Pen-tium III system[4] with 512 MB of RAM, running RedHat Linux 7.2, kernel 2.4.9-13. The benchmarks weused were SPEC jbb2000[5], and Call Control Simula-tor. SPEC jbb2000 is a benchmark designed to evalu-ate server performance by emulating a 3-tier middlewaresystem. It outputs a score for comparing the perfor-mance of JavaVMs and underling platforms. Call Con-trol Simulator is our original benchmark emulating aJAIN Call Control[3] platform running on JBOSS ap-plication server [7]. It requires a client which generatesan input sequence and outputs a throughput every sec-ond.

Table3.1 summarizes the key characteristics of thePentium III processor. The processor has two L1 cacheswhich are one for code and one for data, one L2 cache forboth of them, and two TLB which are I-TLB for codeand D-TLB for data. The technique discussed in this pa-per affects behavior of the L1 code cache, the L2 cache,and the I-TLB.

3.2 Benchmark Results

To maximize the performance improvement, the profilesfor code repositioning must be gathered during the mostappropriate stage of the application’s life cycle. To ful-fill this requirement, we made the input sequence of CallControl Simulator be continuous for the start-up periodand the warm-up period, arrange the threshold for start-ing and stopping the profiling and confirmed that theprofiling is finished until the end of the warm-up period.

In the case of generic benchmarks like SPEC jbb2000it’s not possible to arrange the input sequence, so we justdetermined the thresholds. Therefore the improvementof it may be relatively small.

The effectiveness of the code repositioning techniquedepends on the size of the application’s code working

Table 1: Specification of PentiumPentium III

L1 code cache size 16KB4way

Cache line size 32BL2 cache size 256KB

4wayCache line size 32B

I-TLB size 32 entry4 way

4

set. Table 2 lists the benchmark specifications; the sizeof compiled codes and the number of repositioned meth-ods. The former is the size of the native code compiledfrom frequently invoked methods, and the latter is thenumber of methods invoked frequently while the profilesare gathered and repositioned by recompiling. The sizeof compiled codes is different from the size of the appli-cation’s code working set because JIT just compiles amethod invoked many times. But relations of these twosize are similar to benchmark applications.

Table 3 shows the results for Call Control Simulator.“Cycles instruction fetch stalled” means the number ofcycles the processor stalls while it cannot fetch a next in-struction. The throughput is improved 5.0%, The cyclesinstruction fetch stalled reduced by 9.4% and the I-TLBhit rate was significantly improved. On the other hand ,the L1 cache hit rate and the L2 cache hit rate were notimproved after repositioning codes.

Table 4 shows the results for SPEC jbb2000. The scorewas improved by 1.23%, The cycles instruction fetchstalled reduced by 10.2% and the I-TLB hit rate wasimproved by 9.82%. On the other hand, the L1 cachehit rate and the L2 cache hit rate are not improved assame as the results for Call Control Simulator.

According to these results, major improvements of thecode repositioning technique are derive from the effi-ciency improvements of page usage, not from the effi-ciency improvements of cache usage. The main reasonsof non-effectiveness for cache usage are as follows.

• The size of the cache line of each cache is smallenough. Because the size of methods are much big-ger than the size of cache line, there are normallycodes of only one method in one cache line. There-fore the code repositioning and the method splittingcannot improve the efficiency of cache usage

• The size of the application’s code working set issmall enough relatively to the size of each cache.

3.3 Repositioning Analysis

We observed the frequency of method invocation to ver-ify that the code repositioning algorithm could put fre-quently used portions of code closer together. Figure4 shows the distribution of method invocations for CallControl Simulator before code repositioning, while Fig-ure 5 shows the distribution after code repositioning. Ineach figure, the x-axis represents the head address of thememory pages, and the y-axis represents the number of

Table 2: Application sizeSize of Number of

compiled codes repositioned(bytes) methods (bytes)

SPEC jbb2000 442,140 79Call Control 1,940,012 320Simulator

Table 3: Results for Call Control SimulatorDefault Repositioned Percentages

improvementThroughput 65.5 68.8 5.0%

Cycles instruction 2.12×1010 1.92×1010 9.4%fetch stalled

L1 cache hit rate 97.8% 97.8% 0.0%L2 cache hit rate 70.9% 71.3% 0.4%I-TLB hit rate 79.7% 85.4% 5.7%

Table 4: Results for SPEC jbb2000Default Repositioned Percentages

improvementScore 10,113 10,238 1.2%

Cycles instruction 1.14×1010 1.03×1010 9.8%fetch stalled

L1 cache hit rate 98.5% 98.6% 0.1%L2 cache hit rate 90.2% 90.6% 0.4%I-TLB hit rate 82.5% 92.6% 10.2%

Figure 4: Distribution of method invocation before coderepositioning

Figure 5: Distribution of method invocation after coderepositioning

5

! "$# &% ! ' # )(* + , ! ! ! - )! ".# /% ! '0(* + , ! ! !

1 32445% ! ' # )(* + , ! ! ! 1 326440% ! '0(* + , ! ! !

Figure 6: Coverage of application footprint

invocations of the methods in a page. The distributionbefore the code repositioning shows spread over manypages, while the distribution after the code reposition-ing shows that frequently invoked methods were placedin a smaller number of pages, and most of method invo-cations were limited to methods of these pages.

Figure 6 shows the coverage rate for each applicationbefore and after code repositioning in terms of the num-ber of pages. Because Call Control Simulator was thelargest application, more pages were required to coverits footprint. Before code repositioning 55% of invoca-tions could be covered in 32 pages, which is the numberof I-TLB entries available with a Pentium III. After coderepositioning, on the other hand, 79% number of invo-cations could be covered in 32 pages. The smallest ap-plication, jbb2000, covered 94% in 32 pages before coderepositioning and 99% after code repositioning.

4 Conclusion

This paper has described the techniques and evaluatedthe performance of code repositioning for a JavaVMto improve the performance of a large-scale applicationserver. With this approach, the system profiles the ap-plication’s behavior during the most efficient stage of itslife cycle. It then repositions the compiled code to min-imize the memory footprint of the application’s workingset of instructions.

The results of our evaluation demonstrated impressiveperformance improvements 5.0% for Call Control Simu-lator and from 1.2% for SPEC jbb2000. The most im-provement is caused by improvement of efficiency usageof pages and TLB.

Our future plan is to combine this code repositioningapproach and the inline method expansion technique.

References

[1] M. Arnold, M. Hind, and B.G. Ryder, “Onlinefeedback-directed optimization of java,” Proceed-ings of the 17th ACM conference on Object-orientedprogramming, systems, languages, and applica-tions, pp.111–129, ACM Press, 2002.

[2] R.G. Burger and R.K. Dybvig, “An infrastructurefor profile-driven dynamic recompilation,” Interna-tional Conference on Computer Languages, pp.240–251, 1998.

[3] JAIN Community, http://java.sun.com/products/jain/.

[4] Intel Corporation, “The in-tel xeon processor manuals.”http://developer.intel.com/design/Xeon/manuals/.

[5] The Standard Performance Evaluation Cor-poration, “Spec jbb2000 benchmarks.”http://www.spec.org/osg/jbb2000/.

[6] K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu,and T. Nakatani, “A study of devirtualization tech-niques for a java just-in-time compiler,” Proceed-ings of the conference on Object-oriented program-ming, systems, languages, and applications, pp.294–310, ACM Press, 2000.

[7] JBOSS, http://www.jboss.org/.

[8] M. Paleczny, C. Vick, and C. Click, “The javahotspot(tm) server compiler,” USENIX Java(TM)Virtual Machine Research and Technology Sympo-sium, April 2001.

[9] K. Pettis and R.C. Hansen, “Profile guided codepositioning,” Proceedings of the conference on Pro-gramming language design and implementation,pp.16–27, ACM Press, 1990.

[10] E. Sirer, A. Gregory, and B. Bershad, “A practicalapproach for improving startup latency in java ap-plications,” In Workshop on Compiler Support forSystems Software, INRIA Technical Report ]0228,pp.47–55, 1999.

[11] T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu,and T. Nakatani, “A dynamic optimization frame-work for a java just-in-time compiler,” Proceedingsof the OOPSLA ’01 conference on Object OrientedProgramming Systems Languages and Applications,pp.180–195, ACM Press, 2001.

[12] A. Tamches and B.P. Miller, “Fine-grained dy-namic instrumentation of commodity operating sys-tem kernels,” Operating Systems Design and Imple-mentation, pp.117–130, 1999.

6

Session 2: Numerical Computations

Designing polylibraries to speed up linear algebra

computations∗†

Pedro Alberti, Pedro Alonso, Antonio VidalDepartamento de Sistemas Informaticos y Computacion.

Universidad Politecnica de Valencia.

Cno. Vera s/n, 46022 Valencia, Spain.

palberti,palonso,[email protected]

Javier CuencaDepartamento de Ingenierıa y Tecnologıa de Computadores.

Universidad de Murcia. Aptdo 4021. 30001 Murcia, Spain.

[email protected]

Domingo GimenezDepartamento de Informatica y Sistemas.

Universidad de Murcia. Aptdo 4021. 30001 Murcia, Spain.

[email protected]

June 27, 2003

Abstract

In this paper we analyze the design of polylibraries, where the pro-grams call routines from different libraries according to the characteristicsof the problem and of the system used to solve it. An architecture for thistype of libraries is proposed. Our aim is to develop a methodology whichcan be used in the design of parallel libraries. To evaluate the viabilityof the proposed method, the typical linear algebra libraries hierarchy hasbeen considered. Experiments have been performed in different systemsand with linear algebra routines from different levels of the hierarchy. Theresults confirm the design of polylibraries as a good technique for speedingup computations.

Keywords: linear algebra computation, polylibraries, automatic opti-mization, parallel linear algebra hierarchy, linear algebra modelling.

∗This work has been funded in part by Comision Interministerial de Ciencia y Tec-nologıa, project number TIC2000/1683-C03, and Fundacion Seneca, project number PI-34/00788/FS/01

†This work has been developed using the systems of the CEPBA (European Center forParallelism of Barcelona), the CESCA (Centre de Supercomputacio de Catalunya), and theInnovative Computing Laboratory at the University of Tennessee

1

1 Introduction

The idea we put forward in this paper is to use a number of different basiclibraries to develop a polylibrary. To evaluate the method, the typical parallellinear algebra libraries hierarchy has been considered.

Linear algebra functions are widely used in the solution of scientific andengineering problems, which means that there is much interest in the develop-ment of highly efficient linear algebra libraries, and these libraries are used bynon-experts in parallel programming. In recent years several groups have beenworking on the design of highly efficient parallel linear algebra libraries. Thesegroups work in different ways: optimizing the library in the installation processfor shared memory machines [1] or for message-passing systems [2, 3], analyzingthe adaptation of the routines to the conditions of the system at a particularmoment [4, 5, 7], or using polyalgorithms [8, 9]. Polylibraries can be used incombination with other techniques to speed up linear algebra computations, insuch a way that a user will not need to decide the best library to use, and thepolylibrary is responsible for obtaining a near optimum execution.

In Section 2, an architecture for a polylibrary is proposed, and the ensuingvariations in the library hierarchy are considered. Section 3 shows some of theexperimental results obtained. Finally, in Section 4, some conclusions are given.

2 The Design of Polylibraries

When solving linear algebra problems, basic linear algebra routines are calledto solved different parts of the problem. Typically some version of BLAS [10] orLAPACK [11] is used in sequential or shared memory systems, and ScaLAPACK[12] can be used in message-passing systems. But there may be different versionsof the libraries. It is possible to have libraries optimized for the machine (eitherprovided by the vendor or freely available [13, 14]), or automatic tuning librarieslike ATLAS [1] can be installed, or again alternative libraries can be used [15].

The hierarchy of linear algebra libraries is shown in figure 1. At the lowestlevel of the hierarchy is the basic linear algebra library (BLAS), which may bethe reference BLAS, a BLAS optimized for the machine, provided by the vendoror freely available, or an automatic tuning library. To perform the communica-tions in message passing programs a library optimized for the machine can beused, as could a free version of MPI [16] or PVM [17]. At higher levels of thehierarchy there are also different libraries with the same or similar functional-ity. To solve linear systems or eigenvalue problems in sequential processors, areference or machine optimized LAPACK can be used. To solve the problemin a message-passing platform, different versions of ScaLAPACK or PLAPACKcan be used.

By using libraries from the lower or the middle levels of the hierarchy, wecan develop libraries to solve problems in different fields (libraries to solve ODEproblems, inverse eigenvalue problems, least square Toeplitz problems, etc).

For example, it is possible to solve an eigenvalue problem using ScaLAPACK,

2

ODE library eigenvalue problem library least square Toeplitz library

reference LAPACK

machine LAPACK

ESSL

reference ScaLAPACK

machine ScaLAPACK

PLAPACK

PBLAS

BLACS

reference BLAS

machine BLAS

ATLAS

machine MPI

MPICH

LAM

PVM

Figure 1: Hierarchy of linear algebra libraries.

LAPACK, ATLAS and machine MPI; alternatively, it could be solved usingPLAPACK, machine BLAS and MPICH. Likewise, the same problem, withdifferent sizes, may require distinct libraries.

2.1 The Advantage of Polylibraries

Traditionally the user decides to use a particular set of libraries in the solutionof all the problems. There are some considerations which make it advisable touse different libraries for different problems. Automatic selection is thereforeinteresting for a number of reasons:

• A library optimized for the system might not be available. The use ofan automatic tuning library would be satisfactory, although with timea better library may become available. Thus, it is necessary to have acommon method to evaluate libraries or routines in the libraries, so thatwhen a library is installed, the preferred library is changed.

• When the characteristics of the system change, the preferred library canalso change. This makes it necessary to reinstall the libraries.

• Which library is the best may not be clear since library preference will varyaccording to the routines and the systems in question. This makes it nec-

3

essary to develop the concept of polylibrary. The information generatedin the installation of each library is used to construct the polylibrary.

• Even for different problem sizes or different data access schemes the pre-ferred library can change. This makes it necessary to design polylibrariesin such a way that a routine in the polylibrary calls to one library or an-other not only according to the routine, but also to other factors such asthe problem size or the data access scheme. Hence, the routines in thepolylibrary act as interfaces between the program and the routines of thedifferent libraries.

• We may be working in a parallel system with the file system shared byprocessors of different types, which may or may not share the libraries.In the case of their sharing a basic library, this may be optimized for onetype of processors and not for the others. In any case, the resulting librarywill be different for the different types of processors. In parallel systemsthe polylibrary will be different in processors that do not share the basiclibraries, while in processors sharing the basic libraries, the polylibrarycould be the same, but with the inclusion in the interface of the routinesof the possibility of calling to one library or another, according to theprocessor type.

In this paper, we propose the architecture of a polylibrary, and show withexperiments with some linear algebra routines how the use of polylibraries helpsin the decision of the libraries to be used to reduce the execution time whensolving a problem.

2.2 Architecture of a Polylibrary

A possible architecture for a polylibrary is shown in figure 2. For each of thebasic libraries an installation file (LIF) will be generated, containing informa-tion about the performance of the different routines in the library for differentvariable factors. The format of the different LIFs must be the same, and thegeneration of the LIFs must be carried out in the polylibrary installation orupdating process. This generation can be made by using some model of theroutines to predict the execution time or by performing some executions [18, 2].When the generation is made by performing some executions, the process mustbe guided in order to avoid an overlong installation [19]. For each routine in thepolylibrary, an interface routine is developed.

The polylibrary will use the information generated in the lower level librariesinstallation (LIFs) to generate the code of the interface routine. To do this, theformat of the different LIFs would be the same, and what we propose is todevelop each library together with an installation engine which generates theLIF in the format required. This format must be decided by the linear algebracommunity in order to facilitate the use of the information to design polylibrariesor libraries of higher level in the libraries hierarchy. The information in the LIFcould be the cost of an arithmetic operation using some routines. In some cases

4

Installation

Library 2

Installation

Library 2

Installation

Library 1

LIF 1

Polylibrary interface routine 1 interface routine 2 …

interface routine 1 if n<value call routine 1 from library 1 else depending on data storage call routine 1 from library 1 or call routine 1 from library 2 …

LIF 3 LIF 2

Figure 2: Architecture of a polylibrary.

this could be obtained for different problem sizes, because the use of differentcosts for different sizes can produce a better prediction and consequently areduction in the execution time [20].

For some routines more than one parameter can be used, as for example inmatrix-matrix multiplication, where it could be interesting to store the cost withsquare or rectangular matrices. Also the different access schemes can producedifferent costs, for example, the BLAS routine drot can have different costswhen the data in the vectors are stored in adjacent positions of memory thanwhen they are stored in non-adjacent positions. The format of the LIF couldbe as follows:

drot ld dgemm column1 100 200 20 40 80

100 20n 200 row 40

400 80

The development of installation engines must be included in the basic li-braries (BLAS, ATLAS, MPI, PVM, ...), but also in libraries of higher level(LAPACK, PLAPACK, ScaLAPACK, ...), which allows the development ofhigher level libraries, which could include an installation engine to generatetheir own LIFs.

2.3 Combining Polylibraries with other Optimisation Tech-niques

There are different techniques for obtaining optimized libraries, and better li-braries can be obtained by combining different techniques.

5

In polyalgorithms we have different algorithms to solve the same problem.The algorithms can be implemented in different libraries, and the use of differentalgorithms does not differ from the use of different libraries. If one of the basiclibraries used is an automatically tuned library a large amount of time may benecessary to install the library in the system [1, 2], which means a large amountof time is necessary to update the polylibrary when the conditions on the systemchange. The same is true when installing or reinstalling the polylibrary, and ina parallel system multiple installations will be necessary (possibly in parallel).Thus the installation process must be guided by the system manager [3], whichdecides when it is convenient to reinstall the polylibrary or include a new library,or a new type of processor, and which factors and range of values to consider.

In some cases the value of some algorithmic parameters are obtained at ex-ecution time [20]. These values can be the block size in algorithms by blocks;the number of processors or some parameters determining the logical topologyin a parallel system, ... This means different values of these parameters must beconsidered in the polylibrary installation. The routine interface in the polyli-brary must contemplate this possibility, and the same routine can be called withdifferent values of the parameters for different problem sizes.

It could be interesting to have a system which attends to petitions of theuser and provides a suitable set of resources according to the present state ofthe system. Some research is being done in this direction [5, 6]. To adapt themethod we propose to systems where the load varies dynamically some heuristiccan be used [7]. The values obtained at installation time can be influenced byusing some measures of the system load obtained at running time.


Experiments have been carried out with routines of different levels in the hier-archy:

• In the lowest level of the hierarchy, the routines are the typical BLAS ones.These routines must be installed by performing selected computations,and the information is stored in the LIF of the library. The matrix-matrixmultiplication has been used because it is a representative operation ofthe lowest level libraries, it is used in algorithms by blocks and is a goodexample to show the methodology. In addition, versions by blocks andthe Strassen method [21] have been considered, taking into account thecombination of the design of polylibraries and polyalgorithms.

• Matrix factorizations by blocks [22] have been analyzed (LU and QR)because they are basic operations at a second level of libraries, and useroutines of a lower level, as for example the matrix-matrix multiplication.

• As examples of the highest level of libraries a lift-and-project algorithmto solve the inverse additive eigenvalue problem [23] and an algorithmto solve the Toeplitz least square problem [24, 25] have been considered.

6

In the former, the cost is of order O(n4

)and it is easy to make a good

choice of the parameters to obtain executions close to the optimum. Thesecond algorithm has order O

(n2

), and it is more difficult to make a good

selection of the parameters. In both cases they use routines of lower levels,and those routines are used in a blackbox way.

The experiments have been carried out over different systems because thegoal is to obtain a methodology valid for a wide range of systems. The platformsused have been: a shared memory space system (Origin 2000 with 32 processors),and some networks of processors.

Some of the experimental results obtained are reported in this section. Moreexperimental results can be found in [26].

3.1 Matrix-matrix Multiplication

The matrix-matrix multiplication is an operation from the lowest level of thehierarchy, but higher level implementations have been considered. For polyalgo-rithms we use a direct multiplication, a multiplication by blocks, and a Strassenmultiplication. The use of these algorithms allows us to combine the design ofpolylibraries with the automatic choice of algorithmic parameters: in the algo-rithm by blocks the block size is an algorithmic parameter, and in the Strassenmultiplication the algorithmic parameter is the size of the matrix to stop therecurrence.

For routines in the lowest level of the hierarchy only the sequential case isanalyzed, because these routines would be used by parallel routines which willcall them in each processor.

In a network of six SUNs with two different types of processors (five SUNUltra 1 and one SUN Ultra 5) the interface of the routine would have one part foreach type of processor. The libraries used in the experiments have been: a freelyavailable reference BLAS (BLASref), the scientific library of SUN (BLASmac),and three versions of ATLAS: a version of ATLAS installed in SUN Ultra 1(ATLAS1), and another two prebuilt versions of ATLAS 3.2.0 [30] for Ultra 2(ATLAS2) and Ultra 5 (ATLAS5). The same libraries are used by the SUNUltra 1 and the SUN Ultra 5 because they share the file system.

The values in the LIF can be obtained by performing executions for thedifferent algorithms, problem sizes and values of the parameters provided bythe library manager. This may suppose a large amount of time, and heuristicscan be used to reduce that time. For example, in the algorithm by blocks,experiments can be performed for different blocks for the smallest problem size.In the Strassen algorithm, for the smallest problem size, the problem is solvedwith base size half the problem size, then a quarter of the problem size, and soon until the execution time increases. The optimum base size is taken as theinitial for the local search for the next problem size.

If the installation is made with matrix sizes 400, 800 and 1200, the interfaceroutine could be:

if processor is SUN Ultra 5

7

if problem-size<600solve using ATLAS5 and Strassen method with base size

half of problem sizeelse if problem-size<1000

solve using ATLAS5 and block method with block size 400else

solve using ATLAS5 and Strassen method with base sizehalf of problem size

endifelse if processor is SUN Ultra 1

if problem-size<600solve using ATLAS5 and direct method

else if problem-size<1000solve using ATLAS5 and Strassen method with base size

half of problem sizeelse

solve using ATLAS5 and direct methodendif

endif

The lowest execution time (low.) is compared with the execution time withthe library, method and parameter provided by the model (mod.), and thatobtained using the direct method with ATLAS5. The comparison is shown intable 1. The execution time using the model is close to the lowest, and thesame happens when the direct method and ATLAS5 are used. The conclusioncould be that it is better to determine the best library (ATLAS5) and method(direct). Possibly this is a good option in some cases, but we observe that theoptimum library is not always apparent. The preferred library should be thescientific library of SUN, but it is clearly surpassed by all the versions of ATLAS.In SUN Ultra 1, one would expect the preferred ATLAS to be that installed inthe system by the manager (ATLAS1), but in fact the best is ATLAS5.

The selection of the library, method and parameters is satisfactory in allthe systems considered, and this produces small differences between the lowestexecution time and that obtained with the values provided by the polylibrary.

The use of different algorithms and libraries is not a great improvement withrespect to the use of the best library, but the proposed method can also be usedjust to obtain or update which is the best library among those available.

The method can also be used to decide which algorithm to choose from a setof available algorithms, because each algorithm can be considered as the routineof a different library.

The selection of the parameters which produce lowest execution times canbe combined with the selection of the library or algorithm.

8

Table 1: Comparison of the lowest execution time, the execution time obtainedusing the model, and the execution time using the direct method and ATLAS5.In SUN Ultra 5.

200 600 1000 1400 1600time 0.038 1.063 4.678 12.532 20.031

low. library ATLAS5 ATLAS5 ATLAS5 ATLAS2 ATLAS2method direct direct Strassen Strassen blocks

parameter 2 2 400time 0.041 1.112 4.678 12.575 26.569

mod. library ATLAS5 ATLAS5 ATLAS5 ATLAS5 ATLAS5method Strassen blocks Strassen Strassen Strassen

parameter 2 400 2 2 2ATLAS5 direct time 0.038 1.063 4.833 13.501 31.023

3.2 Matrix factorizations

Automatic tuning medium level routines can be developed by obtaining a the-oretical model of the routine where the system parameters are identified, andthe values of the system parameters are obtained from those stored in the LIFsof lower level libraries. After the substitution of the values some decisions canbe taken, e. g. the best library and the optimum values of some algorithmic pa-rameters. We show how the method works with the LU and QR factorizations.

One of the most difficult parts for the development of a hierarchy like theone proposed is to obtain a common cost model valid for a wide range of sys-tems and algorithms. This is not the goal of this paper, and there are somecomplementary works on this issue [27, 3, 28, 29]. Some of the ideas to considerwhen obtaining a simplified but useful model are: the parameters in the theo-retical model are taken not as constant values, but as a function of the problemsize and some parameters which depend on the algorithm (the block size, thenumber of processors, the topology of the communication network, etc); differ-ent parameters for different operations at the same computational level; andthe values of the parameters which are also dependent on the way the data arestored and accessed.

With these considerations, the cost of the parallel block LU factorizationcan be modelled by the formulae:

23k3,dgemm

n3

p+

r + c

pbk3,dtrsmn2 +

13b2k2,dgetf2n (1)

for the arithmetic cost, and:

ts2n max(r, c)

b+ tw

2n2 max(r, c)p

(2)

for the communication cost.And for the QR factorization the costs are modelled by:

9

43k3,dgemm

n3

p+

14c

bk3,dtrmmn2 +1rbk2,dgeqr2n

2 +12r

bk2,dlarftn2 (3)

and:

tsn

b((2+3b) log r+2 log c)+tw

(n2

2

(2(r − 1)

r2+

log r

c+

log r

r

)+ nb log p

)(4)

The matrix is distributed in a logical two-dimensional mesh (in ScaLAPACKstyle [12]) of p = r × c processors, with block size b.

The system parameters are k3 (operations of level 3 BLAS), k2 (operationsof level 2 BLAS), ts (start-up time) and tw (word-sending time); and the al-gorithmic parameters b (block size), r and c (number of rows and columnsin the network of processors). The cost of the arithmetic operations is con-sidered different for different routines (k3,dgemm, k3,dtrsm, k2,dgetf2, k3,dtrmm,k2,dgeqr2 and k2,dlarft). The two factorizations have in common the parameterk3,dgemm. Hence, they will use the information (LIF) generated for k3,dgemm

when the matrix-matrix multiplication was installed in the system. The li-brary (l) can also be considered as an algorithmic parameter. Thus we havek3,dgemm(n, b, r, c, l), and the same with the other system parameters. In theinstallation of a routine like dgemm, the dependence of the system parameterswith respect to the algorithmic parameters is analyzed, and possibly the numberof dependencies will be reduced: the dependence may be k3,dgemm(b, l).

Experiments have been carried out in a network of Pentium III with Myrinetand Fast-Ethernet. Five libraries have been used: ATLAS, reference BLAS, aBLAS for Pentium II Xeon (BLASII), and two versions of a BLAS for PentiumIII (BLASIII). The results obtained with reference BLAS are always worse thanthose with ATLAS, and the second version of BLASIII always works faster thanthe first version.

In table 2 the experimental results with the LU factorization are summa-rized. The theoretical (the.) and lowest (low.) execution times, the executiontime obtained with the parameters provided by the model (mod.), and the asso-ciated block size are shown for different libraries and problem sizes. The lowestoptimum and modelled execution times for each matrix size are highlighted. Themodel provides good values for the parameters. The best results are obtainedwith BLASIII and BLASII, as predicted by the model. Also the optimum blocksize is well predicted. The difference between the libraries is less important inthe parallel case than in the sequential case due to the cost of communications,which is common to all the libraries.

In table 3 results obtained with the QR factorization are shown. The exe-cution time with the parameters provided by the model (mod.) and the lowestexecution time (low.) are shown together with their parameters. The timeswith the model are always close to the lowest times, because the model makesa good selection of the parameters: number of processors, dimensions of themesh, block size and library.

10

Table 2: Execution time (in seconds) of the parallel LU routine, with differentbasic libraries. In a network of 4 Pentium III with Myrinet.

ATLASthe. low. mod.

time blo. time blo. time blo.512 0.13 32 0.12 32 0.12 321024 0.79 32 0.74 32 0.74 321536 2.36 32 2.21 64 2.27 32

BLASIIthe. low. mod.


BLASIIIthe. low. mod.


Table 3: Comparison of the execution time of the QR factorization. In a networkof 8 Pentium III with Fast-Ethernet.

size: 128 256 384 1024 2048 3072

time 0.025 0.087 0.178 1.92 9.90 28.00mod. processors 1 × 2 1 × 8 1 × 8 1 × 8 1 × 8 1 × 8

block size 8/16 8 8 16 16 16library BLASII BLASII BLASII BLASII BLASII BLASII

time 0.025 0.086 0.176 1.92 9.40 25.11low. processors 1 × 2 1 × 8 1 × 8 1 × 8 1 × 8 1 × 8

block size 16 8 8 16 16 16library BLASII BLASIII BLASIII BLASII BLASII BLASII

11

3.3 Experiments with a Toeplitz Least Square program

The problem to solve is the Least Square (LS) Problem: maxx ||Tx−b||2, whereT ∈ Rm×n is Toeplitz (Tij = ti−j , i = 0, 1, . . . ,m − 1, j = 0, 1, . . . , n − 1) andb ∈ Rm.

Experiments have been carried out in two networks of Pentiums with Myrinet.In a Pentium II network with Myrinet the execution time in the sequential case(seq.), the lowest experimental time (low.) and the execution time with thenumber of processors provided by the model (mod.), are those shown in table4, along with the number of processors used and the speed-up. We can seethe model does not provide the best number of processors, and the differencebetween the optimum execution time and the execution time with the numberof processors provided by the model is on average about 58%. For algorithmswith a small cost it could be preferable to perform some selected executions atinstallation time to decide the values of the parameters, and to use the modelonly for bigger problem sizes [18].

Table 4: Comparison of optimum execution time and the execution time withthe parameters provided by the model. Routine LS Toeplitz in network ofPentium II with Myrinet.

m× n seq. low. mod.time proc. speed-up time proc. speed-up

1000× 500 0.868 0.341 16 2.55 0.868 1 1.001000× 1000 2.144 0.988 8 2.17 1.715 2 1.252000× 200 0.600 0.178 16 3.37 0.383 2 1.572000× 1000 3.630 1.068 16 3.40 1.580 4 2.302000× 2000 9.225 2.922 16 3.16 4.151 4 2.22

The library preferred in the sequential case could also be considered as thatpreferred for the parallel routine. In that way ATLAS would be used in thePentium III network. In table 5 the execution times are compared when usingthe different libraries. The lowest execution time is highlighted for each matrixand number of processors. We can see the best library is now BLASII, whichis the second best in the sequential case. This may be due to the fact thatin parallel we work with smaller vectors in each processor, and the preferredlibrary with the sizes experimented in the sequential case may not be the onepreferred for smaller sizes.

3.4 Lift-and-project Method for the Inverse Additive Eigen-value Problem

A lift-and-project method for the solution of the inverse additive eigenvalueproblem has been used as an example of high level algorithm of high cost. Theorder of the algorithm is O

(n4

), and this high order makes it easier to obtain

12

Table 5: Comparison of the execution time (in seconds) of the parallel routinewith different basic libraries. Routine LS Toeplitz in network of Pentium IIIwith Myrinet.

processors: 1 2 3 4ATLAS

400 0.4566 0.5870 0.5280 0.4151800 1.0109 1.4078 1.1537 1.10231200 1.9772 2.6694 2.3173 1.8172

BLASII400 0.6115 0.5554 0.5088 0.4073800 1.2751 1.3887 1.1412 0.96501200 2.4280 2.5838 2.1818 1.7185

BLASIII400 0.6329 0.5850 0.5180 0.4170800 1.3183 1.4244 1.1862 1.04441200 2.5290 2.5745 2.2229 2.0067

optimizations than in the Toeplitz Least Square algorithm considered, whichhas an order O

(n2

).

The problem is: given a set of L+1 real symmetric matrices A0, A1, ..., AL, ofsizes n×n, and a set of real numbers λ∗

1 ≤ λ∗2 ≤ λ∗

m (m ≤ n), obtain d ∈ RL anda permutation σ = σ1, . . . , σm, with 1 ≤ σ1 < . . . < σm ≤ n, which minimizesthe function F (d, σ) = 1

2

∑mi=1 (λσi

(d)− λ∗i )

2, with λi(d), i = 1, 2, . . . , n, theeigenvalues of matrix A(d) = A0

∑Li=1 diAi.

The lift-and-project method used to solve the Least Square problem has beenthat described in [23]. We analyze the different parts (TRACE, ADK, EIGEN,MATEIG, MATMAT and ZKA0A) to identify the lower level routines used inthe solution of the problem, and the cost of the different parts of the algorithm.The theoretical model obtained is:

iter

(223

ksyev + 2k3,gemm + k3,diaggemm

)n3 +

iter (2k1,dot + k1,scal + k1,axpy) n2L + 2k1,dotn2L2 + ksumnL2 (5)

where iter represents the number of iterations to reach the convergence.In the formula there appear five parameters of routines from the lowest level

of the hierarchy (BLAS), and one from the middle level (LAPACK). Three ofthe parameters of BLAS correspond to operations of BLAS 1, and two are fromBLAS 3, but of the same operation (gemm) with two different types of matrices.

Table 6 shows the time obtained for a 250×250 matrix and with L = 25. Thetotal time and the time in each part of the algorithm appear in the table. Thecombinations of libraries used are: LAPACK and BLAS installed in the system

13

and supposedly optimized for the machine (LaIn+BlIn), reference LAPACKand a freely available BLAS for Pentium III (LaRe+BIII), reference LAPACKand a freely available BLAS for Pentium II (LaRe+BlII), reference LAPACKand the installed BLAS (LaRe+BlIn), LAPACK and BLAS installed for the useof threads (LaInTh+BlInTh), reference LAPACK and a freely available BLASfor Pentium II using threads (LaRe+BlIITh), and reference LAPACK and theBLAS installed which uses threads (LaRe+BlInTh). The lowest execution timewhen not using threads (lnt.) and when using threads (lwt.), and the lowestexecution time (low.) are shown for each part of the algorithm and the totaltime. We can see the lowest time represents an important reduction with respectto the execution time obtained with the best combination (LaRe+BIII), whichis not the one installed in the system. The versions that use threads are notbetter than those that do not.

Table 6: Comparison of the execution time (in seconds) in the different parts ofthe algorithm and with different basic libraries. In dual Pentium III. n = m =250, L = 25.

libraries TRACE ADK EIGEN MATEIG MATMAT ZKA0A TOTAL

LaIn+BlIn 1.69 12.86 165.81 0.94 98.79 14.22 294.32LaRe+BIII 1.16 14.87 210.85 0.83 26.70 10.46 264.89LaRe+BlII 1.16 15.65 255.20 0.86 10.52 10.44 293.85LaRe+BlIn 1.69 16.41 336.49 1.21 123.73 18.03 497.59

lnt. 1.16 12.86 165.81 0.83 10.52 10.44 201.64

LaInTh+BlInTh 1.10 13.92 266.63 0.66 14.13 12.34 308.80LaRe+BlIITh 1.16 15.68 254.34 0.79 6.66 9.99 288.66LaRe+BlInTh 1.10 13.71 249.59 0.62 13.74 11.90 290.68

lwt. 1.10 13.71 249.59 0.62 6.66 9.99 281.70

low. 1.10 12.86 165.81 0.62 6.66 9.99 197.06

If the information for matrix sizes 150 and 250 is stored in the LIFs (and formatrix size 100 the information associated to 150 is used, and for sizes 200 and300 that associated to 250), then the execution time obtained with the model(mod.) is compared with the execution time of the best library (lib.), withthe time using the best library in each part of the algorithm (low.), and withthe mean of the execution time of all the libraries considered (mea.). The meanvalue represents the execution time a non-expert user could obtain when solvingthe problem using some reasonable values of the parameters. The comparison isshown in table 7. The model gives a good approximation to the lowest achievableexecution time, and the time with the model is clearly better than that withthe best library, and much better than the mean time.

4 Conclusions and Future Work

The architecture of a polylibrary has been described, and the advantage of usingpolylibraries has been shown with experiments with routines of different levels inthe libraries hierarchy, and in different systems. The method can be applied to

14

Table 7: Comparison of the execution time using the model (mod.) with themean execution time (mea.), the time with the best library (lib.) and with thebest library in each part of the algorithm (low.). In dual Pentium III. Squarematrices and L = n

10 .size: 100 200 300mea. 3.57 91.30 601.95lib. 3.06 70.34 417.10

mod. 3.10 68.75 408.99low. 3.02 66.54 406.11

sequential and parallel algorithms, and it can be combined with other methodsof computation speed up.

The LIF of each library will contain the cost of an operation for each one ofthe routines in the library. These costs may be different for different data sizesor access schemes. The main issue for the design of polylibraries is the decisionof the format of the LIFs, which must be taken by the linear algebra community.This would allow the development of a hierarchy of automatic tuning libraries,with associated LIFs.

The proposed method could be applied to help in the development of effi-cient parallel libraries in other fields. At the moment we are researching theapplication to Divide and Conquer and Dynamic Programming algorithms.

References

[1] R. Whaley, A. Petitet and J. Dongarra, Automated Empirical Optimiza-tion of Software and the ATLAS Project, Parallel Computing 27 (2001)3–25.

[2] K. Takahiro, A Study on Large Scale Eigensolvers for Distributed MemoryParallel Machines, Ph. D. Thesis, University of Tokyo, 2001.

[3] J. Cuenca, D. Gimenez and J. Gonzalez, Towards the Design of an Au-tomatically Tuned Linear Algebra Library, in 10th EUROMICRO Work-shop on Parallel, Distributed and Networked Processing PDP 2002, (IEEE,2002) 201–208.

[4] A. Kalinov and A. Lastovetsky, Heterogeneous Distribution of Computa-tions While Solving Linear Algebra Problems on Networks of Heteroge-neous Computers, in High Performance Computing and Networking, eds.P. Sloot, M. Bubak, A. Hoekstra and B. Hetzberger, Lecture Notes inComputer Science, no. 1593, (Springer, 1999) 191–200.

[5] A. Petitet, S. Blackford, J. Dongarra, B. Ellis, G. Fagg, K. Roche andS. Vadhiyar, Numerical Libraries and The Grid, International Journal ofHigh Performance Applications and Supercomputing 15 (2001) 359–374.

15

[6] Z. Chen, J. Dongarra, P. Luszczek and K. Roche, Self Adapting Softwarefor Numerical Linear Algebra and LAPACK for Clusters, Submitted toParallel Computing.

[7] J. Cuenca, D. Gimenez, J. Gonzalez, J. Dongarra and K. Roche, Auto-matic Optimisation of Parallel Linear Algebra Routines in Systems withVariable Load, in 11th EUROMICRO Workshop on Parallel, Distributedand Networked Processing PDP 2003, (IEEE, 2003) 409–416.

[8] E. A. Brewer, Portable High-Performance Supercomputing: High-LevelPlatform-Dependent Optimization, Ph. D. Thesis, Massachusetts Insti-tute of Technology, 1994.

[9] A. Skjellum and P. V. Bangalore, Driving Issues in Scalable Libraries:Polyalgorithms, Data Distribution Independence, Redistribution, LocalStorage Schemes, in 7th SIAM Conference on Parallel Processing for Sci-entific Computing, (SIAM, 1995) 734–737.

[10] J. J. Dongarra, J. Du Croz, I. S. Duff and S.Hammarling, A set of Level3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft. 16 (1990)1–17.

[11] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J.Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen,LAPACK Users’ Guide (SIAM, 1999).

[12] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon,J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walkerand R. C. Whaley, ScaLAPACK Users’ Guide, (SIAM, 1997).

[13] Greg Henry web page, www.cs.utk.edu/~ghenry.

[14] Kazushige Goto web page, www.cs.utexas.edu/users/flame/goto.

[15] R. A. van de Geijn, Using PLAPACK: Parallel Linear Algebra Package,(The MIT Press, 1997).

[16] Message Passing Interface Forum web page, www.mpi-forum.org.

[17] PVM web page, www.epm.ornl.org.gov/pvm/pvm home.html.

[18] D. Gimenez and G. Carrillo, Installation routines for linear algebra li-braries on LANs, in 4th International Meeting VECPAR2000, (Faculdadede Engenharia da Universidade do Porto, 2000) 393–398.

[19] S. S. Vadhiyar, G. E. Fagg and J. Dongarra, Automatically Tuned Collec-tive Communications, in Proceedings of the SC’2000 Conference, Dallas,Tx, Nov. 2000.

16

[20] J. Cuenca, D. Gimenez and J. Gonzalez, Modelling the Behaviour ofLinear Algebra Algorithms with Message Passing, in 9th EUROMICROWorkshop on Parallel, Distributed and Networked Processing PDP 2001,(IEEE, 2001) 282–289.

[21] T. H. Cormen, C. E. Leiserson and R. L. Rivest, Introduction to Algo-rithms, (The MIT Press, 1990).

[22] G. H. Golub and C. F. Van Loan, Matrix Computations, (The JohnsHopkins University Press, 1989).

[23] P. Alberti, Solucion Paralela de Mınimos Cuadrados para el Problema In-verso Aditivo de Valores Propios, Master’s Thesis, Universidad Politecnicade Valencia, 2002.

[24] H. Park and L. Elden, Schur-type methods for solving least squares prob-lems with Toeplitz structure, SIAM J. Sci. Comput. 22 (2000) 406–430.

[25] P. Alonso, J. M. Badıa and A. M. Vidal, A Parallel Algorithm for Solvingthe Toeplitz Least Squares Problem, in Vector and Parallel Processing-VECPAR 2000 eds. J. M. L. M. Palma, J. Dongarra V. Hernandez, LectureNotes in Computer Science, no. 1981, (Springer, 2001) 316–329.

[26] A. Alberti, A. Alonso, A. Vidal, J. Cuenca, L. P. Garcıa and D. Gimenez,Designing polylibraries to speed up linear algebra computations, TR LSI1-2003, Departamento de Lenguajes y Sistemas Informaticos, Universidadde Murcia, 2003. dis.um.es/~domingo/investigacion.html.

[27] K. Dackland and B. Kagstrom, A Hierarchical Approach for PerformanceAnalysis of ScaLAPACK-based Routines Using the Distributed Linear Al-gebra Machine, in Applied Parallel Computing: Computations in Physics,Chemistry and Engineering Science eds. Dongarra et. al. (Springer-Verlag,1996) 187–195.

[28] J. Cuenca, L. P. Garcıa, D. Gimenez, J. Gonzalez and A. Vidal, Empir-ical Modelling of Parallel Linear Algebra Routines, Submited to PPAM2003, Fifth International Conference on Parallel Processing and AppliedMathematics.

[29] Yunquan Zhang and Ying Chen, Block size selection of parallel LU andQR on PVP-based and RISC-based supercomputers, R & D Center forParallel Software, Institute of Software, Beijing, China (2003) 100–117.

[30] ATLAS web page at netlib, www.netlib.org/atlas/index.html.

17

A Parallel Implementation of Multi-domain High-order Navier-StokesEquations Using MPI

Hui Wang1, Yi Pan2, Minyi Guo1, Anu G. Bourgeois2, Da-Ming Wei11 School of Computer Science and Engineering,

University of Aizu, Aizu-Wakamatsu, Fukushima 965-8580, Japanhwang,minyi,[email protected]

2 Department of Computer Science,Georgia State University, Atlanta, GA 30303, USA

pan,[email protected]

Abstract

In this paper, Message Passing Interface (MPI) techniques are used to implement high-order full 3-D Navier-Stokes equations in multi-domain applications. A two-domain interfacewith five-point overlap used previously is expanded to a multi-domain computation. Thereare normally two approaches for this expansion. One is to break up the domain into twoparts through domain decomposition (say, one overlap), then use MPI directives to furtherbreak up each domain into n parts. Another is to break the domain up into 2n parts with2n− 1 overlaps. In our present effort, the latter approach is used and finite-size overlaps areemployed to exchange data between adjacent multi-dimensional sub-domains. It is generalenough to parallelize the high-order full 3-D Navier-Stokes equations into multi-domain ap-plications without adding much complexity. Results with high-order boundary treatmentsshow consistency among multi-domain calculations and single-domain results. These exper-iments provide new ways of parallelizing 3-D Navier-Stokes equations efficiently.

1 Introduction

Mathematicians and physicists believe that the phenomenon of the waves following boats and theturbulent air currents following flights can be explained through an understanding of solutions tothe Euler Equations or Navier-Stokes equations. In fact, Euler Equations are simplified Navier-Stokes equations. The Navier-Stokes equations describe the motion of a fluid in two or threedimensions. These equations need to be solved for an unknown velocity vector (say, u(x, t)) andpressure vector (say, p(x, t)) with position x and time t.

In nonlinear fluid dynamics, the full Navier-Stokes equations are written in strong conserva-tion form for a general time-dependent curvilinear coordinate transformation (x, y, z) to (ξ, η, ζ)[1, 2]:

∂

∂t(UJ

) +∂F

∂ξ+

∂G

∂η+

∂H

∂ζ=

1Re

[∂Fv

∂ξ+

∂Gv

∂η+

∂Hv

∂ζ] +

SJ

(1)

where F, G,H are inviscid fluxes, and Fv, Gv,Hv are the viscous fluxes. Here, the unknownvelocity U = ρ, ρu, ρv, ρw, ρEt is the solution vector, and the transformation Jacobian J is afunction of time for dynamic meshes. S is the components of a given, externally applied force.

1

Table 1: The parameters, points, and orders for different schemes in common use.

Scheme Point and Order Γ a b

Compact 6 5-point, 6th-order 1/3 14/9 1/9Compact 4 3-point, 4th-order 1/4 3/2 0Explicit 4 3-point, 4th-order 0 4/3 -1/3Explicit 2 1-point, 2nd-order 0 1 0

Curvilinear coordinates are defined as those with a diagonal metric so that general metric hasgµν ≡ δµ

ν (hµ)2 with the Kronecker Delta δµν .

Equation (1) is simply Newton’s law f = ma for a fluid element subject to the external forcesource term S. Until now however, no one has been able to find the full solutions without helpfrom numerical computations since they were originally written down in the 19th Century.

In order to discretize the above equations, a finite-difference scheme is introduced [2, 3, 4, 5]because of its relative ease of formal extension to higher-order accuracy. The spatial derivativefor any scalar quantity φ, such as a metric, flux component, or flow variable, is obtained in thetransformed plane by solving the tridiagonal system:

Γφ′i−1 + φ

′i + Γφ

′i+1 = b

φi+2 − φi−2

4+ a

φi+1 − φi−1

2(2)

where Γ, a and b determine the spatial properties of the algorithm. Table 1 lists the values ofparameters Γ, a and b, points, and orders in common use for different schemes. The dispersioncharacteristics and truncation error of the above schemes can be found in Reference [2].

A high-order implicit filtering technique is introduced in order to avoid the instabilities, whilemaintaining the high accuracy of the spatial compact discretization. Filtered values φ satisfy,

αf φi−1 + φi + αf φi+1 =N∑

n=0

an

2(φi+n + φi−n) (3)

where the parameter αf has −0.5 < αf < 0.5. Pade-series and Fourier-series analysis areemployed to derive the coefficients a0, a1, . . . , aN . Using curvilinear coordinates, up to 10th-order filter formulas has been developed to solve Naiver-Stokes equations.

Reference [2] provides a numerical computation method to solve the Navier-Stokes equations.To yield highly accurate solutions to the equations, Visbal and Gaitonde use higher-order finite-difference formulas with emphasis on Pade-type operators, which are usually superior to TaylorExpansions when the functions contain poles, and a family of single-parameter filters. Bydeveloping higher-order one-sided filters, Reference [2, 5] obtain enhanced accuracy without anyimpact on stability. With non-trivial boundary conditions, highly curvilinear three dimensionalmeshes, and non-linear governing equations, the formulas are suitable in practical situations.Higher-order computation, however, may increase the calculation time.

To satisfy the need for highly accurate numerical schemes without increasing the calculationtime, parallel computing techniques constitute an important component of modern computa-tional strategy. Using parallel programming methods on parallel computers, we gain access to

2

greater memory and CPU resources not available on serial computers, and are able to solve largeproblems more quickly.

A two-domain interface has been introduced in Reference [5] to solve the Navier-Stokesequations. Suppose there is a two-domain interface, each domain runs first with its inputdata. While each domain writes and sends the data which are needed by another domain to atemporary file located on hard disk, it also checks and reads the temporary file to retrieve thedata output from another domain. This manually treatment, however, has some shortcomings.It requires much more time because it needs to write the data to hard disk. Because of its low-speed computation and limitation of memory, it is obvious that this method is not convenientto be employed for multi-domain calculations.

In our work, parallelizing techniques using Message Passing Interface (MPI) [6, 8, 7, 9] areused to implement high-order full three dimensions Navier-Stokes equations in multi-domainapplications on a multiprocessor parallel system. The two-domain interface used in [5] willbe expanded to multi-domain computations using MPI techniques. There are two approachesfor this expansion. One approach is to break up the domain into two parts through domaindecomposition (say, one overlap), then use MPI directives to further break up each domain inton parts. Another approach is to break it up into 2n domains with 2n - 1 overlaps. The laterapproach is used in this paper and we employ finite-sized overlaps to exchange data betweenadjacent multi-dimensional sub-domains. Without adding much complexity, it is general enoughto treat curvilinear mesh interfaces. Results show that multi-domain calculations reproduce thesingle-domain results with high-order boundary treatments.

2 Parallelization with MPI

Historically, there have been two main approaches to writing parallel programs. One is to use ofa directives-based data-parallel language, and another is to explicit message passing via librarycalls from standard programming languages. The later approach, such as Message PassingInterface (MPI), is left up to the programmer to explicitly divide data and work across theprocessors and then manage the communications among them. This approach is very flexible.

In our program, there are two different sets of input files with two domains. One data setis xdvc2d1.dat and xdvc2d1.in, and another is xdvc2d2.dat and xdvc2d2.in, respectively.xdvc2d1.dat and xdvc2d2.dat are parameter files, while xdvc2d1.in and xdvc2d1.in areinitialized as input metrics files. The original input subroutine DATIN is copied to two inputsubroutines DATIN1 and DATIN2. All processes are sharing data from one of the input files:

IF ( MYID .LT. NPES / 2 ) THENCALL DATIN1

ELSE IF ( MYID .GE. NPES / 2 ) THENCALL DATIN2

ENDIF

For convenience, the size of the input metrics for each processor is the same for those sharingthe same input files.

IF ( MYID .LT. NPES / 2 ) THENIMPI = ILDMY / ( NPES / 2 ) * MYID

ELSEIMPI = ILDMY / ( NPES - NPES / 2 ) * ( MYID - NPES / 2 )

ENDIF

3

IL-5 IL-4 IL-3 IL-2 IL-1 IL

1 2 3 4 5 6

Mesh 1

Mesh 2

Arrows show information transfer

Mesh 1 5-pt overlap

Mesh 2

Figure 1: Schematic of mesh overlap with 5-points in a two-domain calculation [5]

ILDMY is the value of size read from input file. In the following codes, ILDMY gets back the valueof size for each processor.

IF ( MYID .EQ. NPES / 2 - 1 ) THENILDMY = ILDMY - ILDMY / ( NPES / 2 ) * ( NPES / 2 - 1 )

ELSE IF ( MYID .EQ. NPES - 1 ) THENILDMY = ILDMY - ILDMY / ( NPES - NPES / 2 ) * ( NPES - NPES / 2 - 1)

ELSE IF ( MYID .LT. NPES / 2 ) THENILDMY = ILDMY / ( NPES / 2 )

ELSEILDMY = ILDMY / ( NPES - NPES / 2 )

ENDIF

All of the above modifications are done in the main program. However, the important partis in subroutine BNDRY, where we add subroutine BNDRYMPI instead of subroutine BNDRYXD inorder to process multi-domain computations using MPI techniques.

4

In BNDRYMPI, the communication between different processes are completed by the functionsof sending messages and receiving messages.

IF ( MYID .GT. 0 ) THENCALL MPI_SEND, SENDS MESSAGES TO PROCESSOR (MYID - 1)CALL MPI_IRECV, RECEIVES MESSAGES FROM PROCESSOR (MYID - 1)CALL MPI_WAIT ( REQUEST, STATUS, IERR)

ENDIF

IF (MYID .LT. NPES-1) THENCALL MPI_SEND, SENDS MESSAGES TO PROCESSOR (MYID + 1)CALL MPI_IRECV, RECEIVES MESSAGES FROM PROCESSOR (MYID + 1)CALL MPI_WAIT ( REQUEST, STATUS, IERR)

ENDIF

The above calls allocate a communication request object and associate it with the requesthandle (the argument REQUEST). The request can be used later to query the status of the com-munication or wait for its completion. Here only nonblocking message communication is used.A nonblocking send call indicates that the system may start copying data out of the send buffer.The sender should not access any part of the send buffer after a nonblocking send operationis called, until the send completes. A nonblocking receive call indicates that the system maystart writing data into the receive buffer. The receiver should not access any part of the receivebuffer after a nonblocking receive operation is called, until the receive completes. We can seethe configuration of these communication in next section.

3 Multi-domain calculation

In this section we introduce the configuration for multi-domain computations. The commu-nication between any adjacent domains is conducted through finite-size overlaps. There is anunsteady flow generated by a convecting vortex in left-side domain(s) and a uniform flow inright-side domain(s). Some parameters are set in common use, such as the non-dimensionalvortex strength C/(U∞R) = 0.02, the sizes of domain −6 < X = x/R < 6, −6 < Y = y/R < 6,136 × 64 uniformly spaced points with a spacing interval 4X = 4Y = 0.41, and a timinginterval 4t = 0.005, where R is the nominal vortex radius.

The overlap method in Reference [5] is a successful two-domain calculation method. There-fore, we expand this method when developing the MPI Fortran version of the numerical solutionof the Navier-Stokes equations.

To illustrate data transfer between the domains, details of a five-point vertical overlap isshown in Figure 1. The vortex was initially located at the center of Mesh 1 (1st domain, solidlines). When computing begins, the vortex moves towards Mesh 2 (2nd domain, dashed lines)with increasing time. The vortex will not stop, but pass through the interface and then convectinto Mesh 2 after that. The meshes communicate through an overlap region. In Figure 1, forclarity, the overlap points are shown slightly staggered although they are collocated. At everytime-step, data is exchanged between domains at the end of each sub-iteration of the implicitscheme, as well as after each application of the filter. Each vertical line is denoted by its i-index.The data in columns IL− 4 and IL− 3 of the 1st domain were sent to columns 1 and 2 of the2nd domain, and the data in columns 4 and 5 of the 2nd domain were sent to columns IL − 1and IL of 1st domain.

5

1 2 4 51 2 4 5

multi-domain

on (i-1)th on (i+1)th

multi-domain

on ith CPU

multi-domain

IL-4 IL-3 IL-1 IL

on 2nd CPU

1 2 4 5IL-4 IL-3 IL-1 IL

on 1st CPU

IL-4 IL-3 IL-1 IL

2-domain2-domain

Single-domain on 1 CPU

! ! ! !! ! ! !! ! ! !! ! ! !! ! ! !! ! ! !

" " " "" " " "" " " "" " " "" " " "" " " "" " " "# # # ## # # ## # # ## # # ## # # ## # # #

$ $ $ $$ $ $ $$ $ $ $$ $ $ $$ $ $ $$ $ $ $$ $ $ $% % % %% % % %% % % %% % % %% % % %% % % %

& & & && & & && & & && & & && & & && & & && & & &' ' ' '' ' ' '' ' ' '' ' ' '' ' ' '' ' ' '

( ( ( (( ( ( (( ( ( (( ( ( (( ( ( (( ( ( (( ( ( (( ( ( () ) ) )) ) ) )) ) ) )) ) ) )) ) ) )) ) ) )

* ** ** ** ** ** ** ** ** ** ** ** *

++++++++++++

,,,,,,,,,,,,

------------

............

////////////

000000000000

111111111111

2 22 22 22 22 22 22 22 22 22 22 22 2

3 33 33 33 33 33 33 33 33 33 33 33 3

444444444444

555555555555

6 66 66 66 66 66 66 66 66 66 66 66 6

7 77 77 77 77 77 77 77 77 77 77 77 7

888888888888

999999999999

::::::::::::

;;;;;;;;;;;;

<<<<<<<<<<<<

============

Figure 2: The sketch map of sending and receiving messages among different processes forsingle-domain, two-domain, and multi-domain computations.

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20

SET 34-35 at T=0 on domain 1SET 34-35 at T=0 on domain 2

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20


-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20


Figure 3: The swirl velocity as a function of x, obtained in a two-domain calculation withperiodic end conditions are compared at three instants, T = 0 (upper-left), 6 (upper-right), and12 (bottom). Different colors were used to separate velocities of set IL = 34 + 35, JL = 34 ondifferent processors.

6

Multi-domain calculations are performed by using exactly the same principle as in single-domain computations. The only difference is that data are exchanged between adjacent domainsat the end of each scheme and after each filtering.

Expanding the configuration from two-domain to multi-domain, these transformation aretaken between each adjacent domains (processors). The computation is started with the vortexlocated at the center of Mesh i. With increasing time, the vortex convects into Mesh i + 1 afterpassing through the interface. The meshes communicate through an overlap region. Each meshhas IL number of vertical lines, which are denoted by its i-index, and JL number of horizontallines. The values at points 1 and 2 of Mesh 2 are equal to the corresponding values at pointsIL − 4 and IL − 3 of Mesh 1. Similarly, values at points 4 and 5 of Mesh 2 donate to pointsIL − 1 and IL of Mesh 1. The exchange of information is shown with arrows at each pointin the schematic. Note that although points on line IL − 2 (Mesh 1) and 3 (Mesh 2) are alsooverlapped, they do not need to communicate directly. This only occurs when the number ofoverlap points is odd (say, 5 here) and it is called the non-communicating overlap (NCO) point.Figure 2 shows the sketch of sending and receiving messages among different processes in ourprogram. A vertical interface is introduced into the curvelinear mesh as shown schematically.

0 2 4 6 8 100

100

200

300

400

500

600

700

800

Set: IL=136x2,JL=64 Set: IL=68x2,JL=34 Set: IL=34+35,JL=34

Run

ning

Tim

e (s

)


0 2 4 6 8 10

1

2

3

4

5

6

7

8

Set: IL=136x2,JL=64 Set: IL=68x2,JL=34 Set: IL=34+35,JL=34

Spe

edup


Figure 4: Running time and speedup for the set IL = 34 + 35, JL = 34, the set IL = 136 ×2, JL = 64, and the set IL = 68 × 2, JL = 64 versus the number of processors.

The number of each line overlap points can be larger and smaller, but there must be at leasttwo points. Reference [5] investigated some odd number of overlap points, specifically 3, 5, 7points, and the effect of overlap size. By increasing the overlapping size, the instabilities will alsoincrease, and the vortex will be diffused and distorted. In a 3-point overlap, each domain receivesdata only at an end point, which is not filtered. With larger overlaps, every domain needs toexchange several near-boundary points, that need to be filtered due to the characteristic of thePade-type formulas. However, filtering the received data can actually introduce instabilitiesat domain interfaces. On the other side, results show that the accuracy may increase with a

7

larger overlapping region. The 7-point overlap calculation, as opposed to the 3-point overlapcalculation can reproduce the single-domain calculation. To keep a balance, schemes with a5-point horizontal overlap give good performance of reproducing the results of the single domaincalculation while keeping the vortex in its original shape.

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20

swirl

vel

ocity

x

set 136-64 at 2400 with 2 CPUs

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20

swirl

vel

ocity

x


-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20

swirl

vel

ocity

x


-0.015

-0.01

-0.005

0

0.005

0.01

0.015

-10 -5 0 5 10 15 20

swirl

vel

ocity

x


Figure 5: Swirl velocities of multi-domain calculation for the set with IL = 136 × 2 and JL =64 at T = 12 are shown as a function of the number of processors.

When specification of boundary treatment is important, the precise scheme will be denotedwith 5 numbers, abcde where a through e are numbers denoting the order of accuracy of theformulas employed respectively at points 1, 2, 3· · · IL − 2, IL − 1 and IL respectively. Forexample, 45654, consists of a sixth-order interior compact formula, which degrades to 5th- and4th-order accuracy near each boundary. Explicit schemes in this 5 number notation will beprefixed with an E. The curvlinear mesh in this paper is generated with the parameters xmin= ymin = -6, xmax=18, ymax=6, Ax = 1, Ay = 4, nx = ny = 4, and φx = φy = 0. The 34443scheme with a 5-point overlap is used to generate results.

8

4 Performance

All of the experiments were performed on a 24 R10000 processors SGI Origin 2000 distributed-shared memory multiprocessor system. Each processor has a 200MHz CPU with 4 MB of L2cache and 2 Gbytes of main memory. The system runs IRIX 6.4.

Begin with a theoretical initial condition, the swirl velocity,√

(u− u∞)2 + v2, obtained intwo-domain calculation with periodic end conditions are compared at three instants, T = 0(shown in upper-left-side figure), 6 (shown in upper-right-side figure) and 12 (shown in bottom-side figure), respectively. At time T = 0, the vortex center is located at X = 0 and, as timeincreases, it convects to the right at a non-dimensional velocity of unity. The two-domain resultsare clearly demonstrated in Figure 3 which shows the swirl velocity along Y = 0 at T = 12.The discrepancy in peak swirl velocity between the single- and two-domain computations isnegligible. The vortex is well-behaved as it moves through the interface. The contour linesare continuous at the non-communicating overlap point. It shows that the two-domain resultsreproduce the single-domain values quite well. It also indicates that the higher-order interfacetreatments are stable and recover the single domain solution.

The serious problem we first confront is the correctness of our computation. Comparing thetwo-domain computation with the manual treatment by Reference [5], it is sure that the resultsreproduced by the MPI parallel code were found to be in excellent agreement with those inReference [5]. There are not any difference between them.

Speedup figures describe the relationship between cluster size and execution time for a par-allel program. If a program has run before (maybe on a range of cluster sizes), we can recordpast execution times and use a speedup model to summarize the historical data and estimatefuture execution times. Here, the speedup for the IL = 136 × 2 and JL = 64 set versus thenumber of processors used is shown in Figure 4 for MPI based parallel calculations. From thefigure, we can see that the speedup is close to 8 for 10 CPUs. The reason is that, in five-pointoverlap treatment, the computation on each processor is almost independent while the costs ofthe communication between the processors can be disregarded. The speedup is a linear functionof the number of processors. It shows that our parallel code with MPI has good performanceon partition.

The single-domain computation with the same conditions is also included in Figure 4. Sincetwo different input files exist, we can not use MPI based parallel code to calculate using only 1CPU (processor). The problem can be solved later if one input file is needed. On the other side,the limitation of maximum number of processors depends on the size of horizontal points, IL.Suppose we have a set with IL = 34. More than 6 processors may output error messages becauseeach domain must contain at least about 5 points for exchanging with its adjacent domain.

Results of multi-domain calculations for the set with IL = 136 × 2 , JL = 64 are shown inFigure 5. In the figure, the swirl velocities at T = 12 are shown as a function of the number ofprocessors. With a different number of processors, the shape of swirl velocities are very close.This shows that our multi-domain calculation on a multiprocessor system is consistent withsingle domain results.

5 Conclusion

In this paper, multi-domain high-order full 3-D Navier-Stokes equations were studied. MPI wasused to treat the multi-domain computation, which is expanded from the two-domain interfaceused previously. One single-domain is broken into up to 2n domains with 2n - 1 overlaps. Fivefinite-sized overlaps are employed to exchange data between adjacent multi-dimensional sub-domains. It is general enough to parallelize the high-order full 3-D Navier-Stokes equations

9

into multi-domain applications without adding much complexity. Parallel execution results withhigh-order boundary treatments show strong agreement between multi-domain calculations andthe single-domain results. These experiments provide new ways of parallelizing 3-D Navier-Stokes equations efficiently.

References

[1] D. A. Anderson, J. C. Tannehill, and R. H. Pletcher, Computational Fluid Mechanics andHeat Transfer, McGraw-Hill Book Company, 1984.

[2] M. R. Visbal and D. V. Gaitonde, High-Order Accurate Methods for Unsteady VorticalFlows on Curvilinear Meshes. AIAA Paper 98-0131, January 1998.

[3] S. K. Lele, Compact Finite Difference Schemes with Spectral-like Resolution. Journal ofComputational Physics, 103:16-42, 1992.

[4] D. V. Gaitonde, J.S. Shang, and J. L. Young, Practical Aspects of High-Order AccurateFinite-Volume Schemes for Electromagnetics. AIAA Paper 97-0363, Jan. 1997.

[5] D. V. Gaitonde and M. R. Visbal, Further Decelopment of a Navier-Stokes Solution Pro-cedure Based on Higher-Order Formulas, AIAA 99-0557, 37th AIAA Aerospace SciencesMeeting & Exhibit, Reno, NV, Jan. 1999.

[6] W. Gropp, E. Lush, and A. Skjellum, Using MPI Portable Parallel Programming with theMessage-Passing Interface, MIT press, Cambridge, Massachusetts, 1994.

[7] P. Pacheco, Parallel programming with MPI, Morgan Kaufmann, San Francisco, California,1997.

[8] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, and J. Dongarra, MPI The CompleteReference, MIT Press, Cambridge, Massachusetts, 1996.

[9] B. Wilkinson, and M. Allen, Parallel Programming, Prentice Hall, Upper Saddle River, NewJersey, 1999.

10

Developing an Object-oriented Parallel Iterative-Methods Library

Chakib OuarraouiDavid Kaeli

Department of Electrical and Computer EngineeringNortheastern University

ochakib,[email protected]

Abstract

In this paper we describe our work developing an object-oriented parallel toolkit on OO-MPI. We are developing a parallelized implementation of the Multi-view Tomography Toolbox,an iterative solver for a range of tomography problems. The code is developed in object-orientedC++. The MVT toolbox is presently used by researchers in the field of tomography to solve lin-ear and non-linear forward modeling problems. This paper describes our experience parallelizingan object-oriented numerical library, which comprises a number of iterative algorithms. In thiswork, we present a parallel version of BiCGSTAB algorithm and the block-ILU preconditioner.These two algorithms are implemented in C++ using OOMPI (Object-Oriented MPI), run ona 32-node Beowulf cluster. We also demonstrate the importance of using threads to overlapcommunication and computation, as an effective path to obtain improved speedup.keywords: parallel computing, OOMPI, multi-view tomography, iterative methods.

1 Introduction

Developing object-oriented programs on a parallel programming environment is an emergingarea of research. As we begin to utilize an object-oriented programming paradigm, we will need torethink how we develop libraries in order to maintain the encapsulation and polymorphism providedby the language. In this paper we develop parallel versions of methods associated with the IML++library [3]. Our goal is to get an efficient implementation of the library that can be run on a BeowulfCluster. The IML++ library is C++ code that contains a number of iterative algorithms such asBiconjugate Gradient, Generalized Minimal Residual, and Quasi-Minimal Residual. In order todevelop a parallel version of the IML++ library, each individual algorithm needs to be parallelized.

To develop a parallel implementation of an object-oriented library, we need a middleware thatworks seamlessly with an object-oriented programming paradigm. OOMPI is an object-orientedmessage passing interface [7]. The code provides a class library that encapsulates the functionalityof MPI into a functional class hierarchy, providing a simple and flexible interface. We will useOOMPI as we attempt to parallelize the MVT Toolbox on a Beowulf cluster.

This paper is organized as follows. In Section 2 we will describe the Multi-view TomographyToolbox. In Section 3 we will discuss the design of the IML++ library. In Section 4 we will gointo the details of the parallel BiCGSTAB algorithm and the ILU preconditioner. In Section 5 wediscuss the issue of scalability as a function of the amount of communication inherent in the code.Finally in Section 6 we summarize the paper and discuss directions for future work.

2 MVT Toolbox

The MVT Toolbox provides researchers in the many tomography fields with a collectionof generic tools for solving linear and non-linear forward modeling problems [6]. The toolbox isdesigned for flexibility in establishing parameters to provide the ability to tailor the tool to specifictomography problem domains. This library provides a numerical solution to the partial differentialequation of the form

· σ(r) φ(r) + k2(r)φ(r) = s(r), (1)

where r is a point in 3-space; σ(r) is the conductivity and can be any real-valued function of space;k2(r) is the wave number and can be any complex-valued function of space; φ(r) is the field forwhich we are solving; s(r) is the source function; and the boundary conditions can be Dirichlet,Neumann or mixed.

The MVT Toolbox is designed as a numerical solver for problems of the general form definedby Equation 1. The Toolbox can solve this problem on a regular or semi-regular Cartesian grid.It has the ability to model arbitrary sources as well as receivers. It also can use a finite-differenceapproach for discretization of the equation defined before and associated boundary conditions in3-D space. In this case, the problem of solving the differential equation for the field is reduced tosolving a large sparse system of linear equations of the general form Af = s, where the matrix Arepresents the finite difference discretization of the differential operator, as well as the boundaryconditions. The vector f contains the sample values of the field at the points on the grid. Thevector s holds the discretized source function. IML++ package is used as the main solver for theequation Ax = b.

3 IML++ Library

The IML++ package [3] is a C++ version of set of iterative methods routines. IML++ wasdeveloped as part of the Templates Project [2]. This package relies on the user to supply matrix,vector, and preconditioner classes to implement specific operations. The IML++ library includesa number of iterative methods that support the most commonly used sparse data formats. Thelibrary is highly templatized such that it can be used on different sparse matrix formats, togetherwith a range of data formats. The library is described in [2], and contains the following iterativemethods:

• Conjugate Gradient (CG)

• Generalized Minimal Residual (GMRES)

• Minimum Residual (MINRES)

• Quasi-Minimal Residual (QMR)

• Conjugate Gradient Squared (CGS)

• Biconjugate Gradient Squared (BiCG)

• Biconjugate Gradient Stabilized (Bi-CGSTAB)

4 Parallel MVT

To develop a parallel implementation of the MVT Toolbox, we have utilized a profile-guidedapproach. The main advantage of this technique is that there is no need to delve into the detailsof the library (source code for the library may not be available). Profiling allows us to concentrateonly on the hot portions of the code.

The first step in our work is to profile the serial code on a uni-processor machine. We will thendevelop a parallel version of the hot code.

4.1 Profiling the MVT Toolbox

To profile the Toolbox, we used the technique described in [1, 4]. A similar technique hasbeen used to profile IO accesses [8]. Using this technique, the serial code is profiled on singleprocessor machine. The GNU utility gprof was used to profile the library. The results show thatmost of execution time is spent in the Iterative Method Library (IML++). So we chose to focusour parallelization work on the IML++ library.

4.2 Parallelization of Object-oriented Code

The IML++ library code computes solutions for different sources and frequencies. Thereforethe easiest way to parallelize the toolbox is parallelize on a coarse grain, and compute each sourceon a node of the cluster. However, in general, the number of modeled sources is less than thenumber of processors, so the available resources will be underutilized. Another option is to follow ahybrid approach where each source is solved on a group of processors. While this approach obtainsgood results, it is still necessary to parallelize the IML++ library.

4.3 Parallel IML++

To begin to parallelize the IML++ library, we need to consider the various iterative algorithmsincluded with the package. Each individual algorithm needs to be parallelized. In this section we willexplore a parallel implementation of the Biconjugate Gradient Stabilized algorithm (BiCGSTAB),as well as a parallel version of the Incomplete LU factorization code.

4.3.1 BiCGSTAB Algorithm

The BiCGSTAB algorithm is a Krylov subspace iterative method for solving a large non-symmetric linear system of the form Ax = b. The algorithm extracts approximate solutions fromthe Krylov subspace. BiCGSTAB is similar to the BiCG algorithm except that BiCGSTAB providesfor faster convergence. The pseudocode for the BiCGSTAB algorithm is provided in Figure 1.

4.3.2 Parallel BiCGSTAB Algorithm

Iterative algorithms can be challenging to parallelize. In the BiCGSTAB algorithm, there isa loop-carried dependence in the algorithm. If we attempt to parallelize the algorithm using thisform of the code, a synchronization barrier is required after each iteration, so most of the time willbe spent synchronizing on shared access to the data. There has been related research on minimizingthe number of synchronizations. Following the approach described by Laurence [9], the BiCGSTABalgorithm can be restructured so that there is only one global synchronization by redefining andsplitting the equations of the original algorithm. For example, vn can be written as

1 : r0 = b−Ax0

2 : ρ0 = α0 = ω0 = 1;3 : v0 = p0 = 0;4 : for n = 1, 2, 3,...do5 :5 : ρn = rT

0 rn−1;6 : β = ρn

ρn−1

αnωn−1

;7 : pn = rn − 1 + βn(pn−1 − ωn − 1vn−1);8 : vn = Apn;9 : αn = ρn

rT0 vn

;

10 : sn = rn−1 − αnvn;11 : tn = Asn;12 : ωn = tTn sn

tTn tn;

13 : xn = xn−1 + αnpn + ωnsn;14 : if xn is accurate enough then15 : STOP16 : endif17 : rn = sn − ωntn;18 : endfor

Figure 1: The BiCGSTAB Algorithm.

vn = Apn = A(rn−1 + βn(pn−1 − ωn−1vn−1))= un−1 + βnvn−1 − βnωn−1qn−1

The same thing can be done for all vectors so that φn = r0sn, andτn = r0vn.These code transformations make the task of parallelization more straightforward. A modified

version of the BiCGSTAB algorithm is presented in Figure 2. In the modified BiCGSTAB algorithm,we can clearly see that the inner products in steps 13–18 do not involve loop-carried dependencies.The vector updates in lines 10 and 11 are independent, and the vector updates in lines 21 and 22are also independent.

4.3.3 Implementation of Parallel BiCGSTAB

IML++ is a highly templatized library that provides the user with the ability to reuse the codefor different problem classes and data formats. To maintain this polymorphism when moving to aparallel environment, some modifications needed to be made to the structure of the MVT Toolbox.The parallel implementation of the BiCGSTAB algorithm requires the addition of new methods tothe vector and matrix classes. These methods basically set the range for the computations. Makingthese changes will help to ensure the efficiency and reusability of the classes. We have other optionsavailable that still maintain the object-oriented nature of that code and these options would notimpact main library. One solution would be to cast the input vector and matrix into a newlydefined class that contains the needed methods. However this would introduce additional memoryoverhead. Another option would be to extract the data from the input vectors and matrices, butthis would add additional processing overhead on each access to the data.

So the option we arrived at requires that the input vector and matrix classes provide a new

1 : r0 = b−Ax0, u0 = Ar0, f0 = AT r0, q0 = v0 = z0 = 0;2 : σ−1 = π0 = φ0 = τ0 = 0, σ0 = rT

0 u0, ρ0 = α0 = ω0 = 1;3 : forn = 1, 2, 3, ...do4 : ρn = φn−1 − ωn−1σn−2 + ωn−1αn−1πn−1;5 : δn = ρn

ρn−1αn−1, β = δn

ωn−1;

6 : τn = σn−1 + βnτn−1 − δnπn−1;7 : αn = ρn

τn;

8 : vn = un−1 + βnvn−1 − δnqn−1;9 : qn = Avn;

10 : sn = rn−1 − αnvn;11 : tn = un−1 − αnqn;12 : zn = αnrn−1 + βnzn−1 − αnδnvn−1;13 : φn = rT

0 sn;14 : πn = rT

0 qn;15 : γn = fT

0 sn;16 : ηn = fT

0 tn;17 : θn = sT

n tn;18 : κn = tTn tn;19 : ωn = θn

κn;

20 : σn = γn − ωnηn;21 : rn = sn − ωntn;22 : xn = xn−1 + zn + ωnsn;23 : if xn is accurate enough then24 : STOP25 : end if26 : un = Arn;27 : end for

Figure 2: Parallel BiCGSTAB Algorithm.

0

0.5

1

1.5

2

0 5 10 15 20

# Nodes

Rel

ativ

e S

pee

du

p

Figure 3: Speedup of BiCGSTAB

class methods that define the range for the computation. Our selection was made based on theneed to keep the BiCGSTAB algorithm highly reusable.

We developed an initial parallelized version of the code using OO-MPI and ran it on our 32-nodeBeowulf Cluster (using only 24 nodes). Our initial performance results were somewhat surprising,only obtaining a 2X speedup on a 16-node configuration (see Figure 3).

To better understand the performance bottlenecks in this code, we used method-level profiling.The goal was to identify both the hot portions of the program (based on the number of method calls)and the methods where the program spends most of its execution. Figure 4 shows the percent oftime spent in ILU while varying the number of compute nodes. We found that the preconditionerconsumes a significant amount of the execution, so we next parallelized the ILU preconditionercode.

4.4 ILU Preconditioner

Preconditioners are used to accelerate convergence of an algorithm. The basic idea is to finda preconditioner for a linear system Ax = b that is a good approximation to A so that the samesystem is much easier to solve than the original system. Mathematically, the system can be solvedusing: M−1Ax = M−1b by any iterative method. The preconditioner is not always easy to find.There are many techniques used to arrive at the best preconditioner. Incomplete factorizationpreconditioners are commonly used since they are suitable for any type of matrix. The pseudocodeof the ILU algorithm is given in Figure 5.

4.5 Parallel ILU Preconditioner

There are a number of parallel algorithms that implement the ILU method. However it is hardto find a parallel ILU algorithm that is scalable. Most ILU methods are sequential in nature andthus they are difficult to parallelize on a cluster. In this paper we utilize the block ILU algorithmdescribed in [5].

05

1015202530354045

1 2 4 8 16

% t

ime

spen

t on

IL

U

Figure 4: Profiling of the ILU Preconditioner Code

Algorithm 4.1: ILU(M, x, D)

for i← 1 to nfor j ← 1 to i

v = v + A(i, j) ∗ r(i)v = x(i)− v

for i← n− 1 to 1for j ← n− 1 to i

v+ = A(i, j) ∗ w(i)v = r(i)− vv = v ∗D(i)w(i) = v

Figure 5: The ILU Algorithm

P0

P1

P2

P3

Overlapped region

Figure 6: Block-ILU Algorithm

The block ILU chosen splits the matrix into blocks, assigning each block to a processor. Thistechnique involves no synchronization between processors, which makes it potentially scalable onclusters. The block parameter that defines the size of the overlap region is controllable. The matrixis divided into overlap blocks, and each block is stored in a single processor. Figure 6 describesthis partitioning method. In this example, the original matrix is partitioned into 4 overlap blocks.Each block is assigned to a processor. The size of the overlap region is controllable, which helpsthe user to search for a better solution. The size of the overlap region is very important since alarge overlap region will affect the performance of the code significantly. Using a small overlap willaffect the quality and the convergence of the solution.

Now that we have described the parallel BiCGSTAB and parallel ILU preconditioner, we needto run our parallel implementation on our cluster. Runtime results are shown in Figure 7. Theparallelized version runs much faster and obtains better scalability. The speedup is close to 4 ona 8 processor machine (thought degrades when additional processors are used). This is not theresult that were hoping for. We hoped to get linear scalablity. So we continued to profile the code,looking for additional opportunities for tuning.

We next focus on the communication overhead present in the code. To obtain this profile,we inserted trace points into the code before and after each synchronization point. We obtaineda profile of the total communication time and the communication time for each synchronizationpoint.

Looking at the results in Figure 8, we can see that communication time for 2, 4, 8, 16 is about20%, 30%, 35%, 44% of the total time. Obviously, we have scalability issues here. To address thisproblem, we need to hide some of this communication. One solution is to overlap communicationwith computation. However in our case, we are using a blocking Allreduce() method. A non-blocking version of the Allreduce() method is not provided. By using a non-blocking version weexpect the speedup to be closer to 8 on a 16 processor system.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5 10 15 20

# Nodes

Rel

ativ

e Sp

eedu

p

Figure 7: Speedup of the BiCGSTAB using Block-ILU

0102030405060708090

100

1 2 4 8 16

# Nodes

% C

omm

unic

atio

n T

ime

Figure 8: Communication profile for the code.

0

0.51

1.52

2.53

3.54

4.5

0 5 10 15 20

# Nodes

Rel

ativ

e Sp

eed

Original Threaded

Figure 9: Original vs. Threaded version Speedup.

4.6 Threaded Communication

In our work, we have found that there exist scalability issues which are caused by the com-munication overhead. A straight-forward approach to address this problem is to allow overlap thecommunication and computation. In order to implement this solution, we needed to develop a num-ber of non-blocking functions, specifically the Allreduce() and Allgather() functions. Rewritingthe Allreduce() call using Isend and Irecv is not a good solution since there is a need to synchro-nize all processors at each processing step (i.e., too much communication). Next, we describe oursolution to this problem using threads.

Our solution is to spawn a communication thread on each processor that only handles MPIcalls such as Allreduce() and Allgather(). Once the communication thread is started, it goes tosleep and waits for the main process to issue a wake-up signal. The main process spawns a wakeupcommunication thread, which runs in parallel to the main process which is performing independentcomputation. Once the main process is done with the computation, it waits for the communicationthread to finish communication with the other processors. This is an efficient solution since thecommunication thread will most likely be sleeping, waiting for a command from the main processor waiting for an MPI call to complete. However there is also the overhead associated with threadcreation and synchronization with the main process. If the application involves a significant amountof computation, higher performance will result using our thread-based synchronization scheme. Onthe other hand, if the application is communication-dominated, then the main process will bestalled waiting for communication to complete.

As shown in Figure 9, we obtained improved speedup for the threaded version of the parallelizedcode, up to 8 nodes. However, when using 16 nodes, the execution is still dominated by commu-nication. Since the problem is divided over more nodes, the problem size is smaller so less timeis spent on computation. Also, communication overhead will be higher as we add computationalnodes, and additional overhead for thread synchronization will result.

5 Discussion

It should be clear from our results that this algorithm is not scalable on the system used inour performance study. This cluster utilizes 100 megabit/second switched (i.e., Fast) Ethernet as acommunication fabric. It is clear that communications are the bottleneck for this algorithm. We arepresently moving our work to a gigabit cluster system, and expect to obtained better speedup. Thelatency of Gigabit Ethernet is very similar to that of Fast Ethernet; they only differ in bandwidth(there would be little difference in performance if the application is only exchanging small messagesin the absence of message collisions). In our case, the algorithm also needs to gather a vectoron all processors, so we should expect a performance increase. To estimate the total time of thecommunication overhead we developed a model 2, where T is the total time, T0 is the latency, Mthe message size in bytes and r∞ is the bandwidth.

T = T0 +M

r∞(2)

From the algorithm, we know the size of each message and the frequency of each MPI call, sowe can estimate the total time due to communication overhead. In this estimate, we assume thatAllgather() and Allreduce() are executed in 4 steps on a 16 node cluster. Applying these valuesto our model (Equation 2) gives us 34 sec for Fast Ethernet and 3.34 sec for Gigabit Ethernet.We clearly can see that the experimental results for Fast Ethernet are very close to the resultsobtained from our model. This suggests that the communication overhead obtained when we moveto a Gigabit network will be one-tenth of the amount seen when using Fast Ethernet. This wouldprovide us with a speedup of near 10 on a 16-node cluster. Since gigabit-connected clusters willshortly become the de-facto network fabric, our parallel library will provide users with effectivepathway to obtain scalable performance.

6 Summary

In this paper we presented the parallelization of an object-oriented library that implementsthe BiCGSTAB and ILU algorithms. This work is still in progress.

We have shown the importance of using threads to overlap communication and computation.Our initial results on Fast Ethernet showed that these algorithms are inherently dominated bycommunications. However, we have studied the amount of communication in these algorithmsand predict that improved scalability will be obtained on a gigabit-connected cluster. Some nextsteps are to develop parallel versions of the CG, BiCG, GMRES and QMR algorithms, and thenintegrate these into the parallelized IML++ library. We have shown that the parallel iterativemethod library is both an object-oriented interface to speed code development and also will beable to obtain reasonable parallel speedup on evolving network fabrics. We plan to explore usingblocked versions of some the algorithms, which initially appears to be a promising direction for ourwork. We also plan to improve the design, targeting increased flexibility and reusability.

7 Acknowledgements

This work is funded by CenSSIS, the Center for Subsurface Sensing and Imaging Systems, underthe Engineering Research Centers Program of the NSF (Award Number EEC-9986821).

References

[1] M. Ashouei, D. Jiang, W. Meleis, D. Kaeli, M. El-Shenawee, E. Mizan, Y. Wang, C. Rappaport, andC. Dimarzio. Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applica-tions. Journal of Systems, Science and Technology, pages 40–55, 2002.

[2] R. Barrett, M. Berry, T.-F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine,and H. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for IterativeMethods. SIAM, Philadelphia, PA, 1994.

[3] J. Dongarra, A. Lumsdaine, X. Niu, R. Pozo, and K. Remington. A Sparse Matrix Library in C++ ForHigh Performance Architectures. In Proceedings of the 2nd Object Oriented Numerics Conference, pages214–218, 1994.

[4] M. El-Shenawee, C. Rappaport, D. Jiang, W. Meleis, and D. Kaeli. Electromagnetics ComputationsUsing the MPI Parallel Implementation of the Steepest Descent Fast Multipole Method (SDFMM).ACES Journal, 17(2):112–122, 2002.

[5] P. Gonzalez, J. C. Cabaleiro, and T. F. Pena. Parallel Incomplete LU Factorization as a Preconditionerfor Krylov Subspace Methods. Parallel Processing Letters, 9(4):467–474, 1999.

[6] The Multi-View Tomography Toolbox. URL: http://www.censsis.neu.edu/mvt/mvt.html, Center forSubsurface Sensing and Imaging Systems, 2003.

[7] J. Squyres, B. McCandless, and A. Lumsdaine. Object Oriented MPI: A Class Library for the MessagePassing Interface. In In Proceedings of the POOMA Conference, 1996.

[8] Y. Wang and D. Kaeli. Profile-Guided I/O Partitioning. In Proceedings of the International Conferenceon Supercomputing, pages 252–260, 2003.

[9] L. T. Yang and R. P. Brent. The improved BiCGStab method for large and sparse unsymmetric linearsystems on parallel distributed memory architectures. In Proceedings of the Fifth International Confer-ence on Algorithms and Architectures for Parallel Processing, pages 324–328, Beijing, 2002. ICA3PP-02.

A SUPER-PROGRAMMING TECHNIQUE FOR LARGE SPARSE MATRIX

MULTIPLICATION ON PC CLUSTERS

Dejiang Jin and Sotirios G. Ziavras

Department of Electrical and Computer Engineering

New Jersey Institute of Technology

Newark, NJ 07102

Abstract. The multiplication of large spare matrices is a basic operation for many scientific and engineering

applications. There exist some high-performance library routines for this operation. They are often optimized

based on the target architecture. The PC cluster computing paradigm has recently emerged as a viable alternative

for high-performance, low-cost computing. In this paper, we apply our super-programming approach [24] to study

the load balance and runtime management overhead for implementing parallel large matrix multiplication on PC

clusters. For a parallel environment, it is essential to partition the entire operation into tasks and assign them to

individual processing elements. Most of the existing approaches partition the given sub-matrices based on some

kinds of workload estimation. For dense matrices on some architectures estimations may be accurate. For sparse

matrices on PC, however, the workloads of block operations may not necessarily depend on the size of data. The

workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may

degrade performance. Moreover, in a heterogeneous environment, statically partitioning is NP-complete. For

embedded problems, it also introduces management overhead. In this paper We adopt our super-programming

approach that partitions the entire task into medium-grain tasks that are implemented using super-instructions; the

workload of super-instructions is easy to estimate. These tasks are dynamically assigned to member computer

nodes. A node may execute more than one super-instruction. Our results prove the viability of our approach.

Keywords. PC cluster, matrix multiplication, load balancing, programming model, performance

evaluation.

1. Introduction Matrix multiplication (MM) is an important linear algebra operation. A number of scientific and engineering

applications include this operation as building block. Due to its fundamental importance, much effect has been

devoted to studying and implementing MM [1-8]. MM has been included in several libraries [1, 6, 9-12,14. Some

MM algorithms have been developed for parallel systems [2, 3, 5, 13, 15, 16]. In many applications, the matrices

are sparse. Although all sparse matrices can be treated as dense matrices, the sparsity of matrices can be exploited

to gain high performance. For parallel MM, the entire task should be decomposed. This introduces various

1

overheads. The most important are the communication and load balancing overheads. Partitioning matrices into

sub-matrix blocks and decomposing the MM are often dependent technology [4,13]. Applying special

implementations for sparse matrix operation to sub-matrix blocks may improve performance. It also makes the

workload of sub-matrix operation to be diverse. The information about workload of a task for sparse matrix

operation to sub-matrix blocks may not be available at compile time or even the time of initialing subroutines. It

is only available after these routine programs have been executed. This increases the complexity of load

balancing. So it is often ignored.

Most algorithms are optimized based on the characteristics of targeting platform [2,3,5,15,16]. The PC cluster

computing platform has recently emerged as a viable alternative for high-performance, low-cost computing.

Generally, the PCs of a cluster have a lot of resource, which can be used simultaneously. They have relatively

weak communicating capability. So far, they lack high performance implementation support for data

communication as efficient as supercomputer. It only has some communication channels implemented by

software. Data communication still compete CPU resource with computation. So we need a new programming

model to fit PC-cluster architecture better.

MM operations for sparse matrices are embedded in many host programs. The optimal matrix partitioning has

to be performed at run time rather than at compile time even if MM algorithm adopts “static strategy”. Parameters

must be optimized when sub-routine of MM is launched. This introduces an additional management overhead. In

heterogeneous environments, finding the optimal partitioning is an NP-complete problem [13]. How we can

reduce this kind of impact still needs to be considered to find a high performance solution for MM on PC clusters.

To address this problem, we use in this paper our recently introduced super-programming model (SPM) [24].

Using SPM, we analyze the performance of Super programs for MM, and discuss the effect of various strategies.

2. Super Program Development

2.1 Super Programming Model

In the super-programming model, the parallel system is modeled as a single virtual machine with multiple

processing nodes (member Computers). [24] Application programs are modeled as super-programs (SPs) that

consist of a collection of basic abstract tasks. These tasks are modeled as super-instructions (SIs). SIs are atomic

workload units. SPs are coded with SIs. SIs are dynamically assigned to available processing units at run time,

thus rendering the task assignment process a simpler task.

In SPM, the key elements are super-instructions. They are high-level abstract operations to solve application

problems. In assembly language programming, each instruction in a program is indivisible and is assigned to

respective functional units to execute. Similarly to assembly language programming, in SPM, each SI also is

indivisible. Each SI can only be assigned and executed on a single node. At the PC executable level, they are

some routines implementing frequently used operations for the corresponding application domain. Each SI can

2

execute on any member computer in the PC cluster. In heterogeneous environment, for a SI, there are many

different executables corresponding to different computers.

The key difference of SPM compared to most of the other high-level programming models is that these super-

instructions are designed in a workload-oriented manner rather than function-oriented. (More detailed

comparisons follow in Section 4). They have limited workload rather than arbitrary workload. An example of

such a SI is “multiply a pair of matrices in which the number of elements is not more than k”. k is a design

parameter. Programmers or compilers can expect that it can be implemented within a known maximum execution

time on a given type of processing unit. In this manner, the system can schedule the task efficiently, thus

minimizing the chance of member nodes being idle and reducing overhead of parallel computing. When the

degree of parallelism in the SP, measured in number of super-instructions, is much larger than the number of

nodes in the cluster, any node has little chance to be idle.

Super-instructions are basic atomic tasks with workloads that have a quit accurate upper bound. Application

programs usually consist of logic functions based on the adopted algorithm. Application programs developed

under this SPM are composed of SIs. These programs still can group SIs for assignment to logic function units.

We call such logic functional units super-functions. A super-function achieves a logic operation with arbitrary

workload. SI is workload-oriented. An SI may not achieve a complete logic operation but a part of a logic

operation with predefined maximum workload. In SPM, super-functions are implemented with SIs. A super-

function with little workload may be completed in one SI. In the general case, based on the designed parameters, a

super-function, however, cannot be completed in just one SI, because of its limited workload. A number of SIs

may be required. Breaking a super-function into a lot of SIs may increate the capability of the PC cluster to

exploit the high-level parallelism in super-programs. As mentioned earlier, duo to communication overheads, it is

difficult for PC clusters to exploit low-level parallelism. Thus, to successfully achieve computation parallel, it is

critical to sufficiently exploit high-level parallelism. Independently assigning SIs, rather than super-functions,

provides more high-level parallelism. Extending the functional unit to handle multiple SIs makes the high-level

parallelism not only to be determined by the algorithm but also by the SI designer. Increasing the degree of high-

level parallelism makes easier the task of balancing the workload.

To effectively support application portability among different computing platforms and to also balance the

workload in parallel systems, an effective instruction set architecture (ISA) should be developed for each

application domain under SPM. These operations then become part of the chosen instruction set. Application

programs should maximize the utilization of such instructions in the coding process. Then, as long as an efficient

implementation exists for each SI on given computing platforms, code portability is guaranteed and good load

balancing becomes more feasible by focusing on scheduling at this coarser SI level.

2.2 Super-Instruction Set

An effective ISA should be developed for each application domain. These super-instructions should be

frequently used operations for the corresponding application domain. Matrix multiplication is used as a basic

3

operation in many different applications. The program is often embedded in these host programs. So the super-

instruction set for MM should be a common sub set of ISAs for these application domains. Also addition and

multiplication of sub-matrix blocks are good candidates of super-instructions. Although matrix multiplication

inherently involves the addition and multiplication of elements of operands matrices, it, however, can be achieved

through the addition and multiplication of arbitrarily partitioned sub-matrices. In such a block-wise approach, the

operands and products are all treated as much smaller matrices with elements also being matrices. Because the

computing workload of multiplication and addition are determined by the size of the sub-matrix blocks, their

workload can be easily limited by reducing the size of blocks.

The sparsity in matrices can be exploited to reduce the workload of these operations to improve the

performance of the entire program. When using specific algorithms, the workload of these operations increases

with the sparsity of the sub-matrix block but not more than that of the dense matrix.

For sparse matrices, generally, the non-zero elements may not be distributed in uniform manner. This means

the sub-matrix blocks may have different sparisity. Some blocks may be sparser than entire matrix but some

others may be denser than the entire matrix or may not even have any zero element. Since the workload of

multiplication implemented with an optimal algorithm for sparse matrices depends on the sparsity of the matrix,

this diversity of sparsity for sub-matrix blocks makes the workload of sub-matrix multiplication to vary

substantially. We have chosen following super-instructions:

1) denseToDenseMultiply( A, B, C): A is an n1 x n2 dense matrix; B is an n2 x n3 dense matrix; C is an n1 x n3

dense matrix. The functionality of this SI is C = C+ A*B. n1, n2, and n3 are all not greater than a pre-

determined value k1.

2) sparseTodenseMultiply( A, B, C): A is an n1 x n2 sparse matrix; B is an n2 x n3 dense matrix; C is an n1 x n3

dense matrix. The functionality of this SI is C = C+ A*B. n1, n2, and n3 are all not greater than a pre-

determined value k2.

3) sparseToSpareMultiply(A, B, C): A is an n1 x n2 sparse matrix; B is an n2 x n3 sparse matrix; C is an n1 x n3

potentially dense matrix. The functionality of this SI is C = C+ A*B. n1, n2, and n3 are all not greater than a

pre-determined value k3.

2.3 A Super-Program for Matrix Multiplication

We first discuss MM for distributed-memory parallel computers. We focus on an implementation for the

ScaLAPACK library using 2D homogeneous grids [11]. We assume the problem C= A*B, where A, B and C are

N x N square matrices; the elements of these matrices have the same size square sub-matrix blocks. We also

assume that the computer system has n = p2 nodes configured in a 2D square grids; Also N can be divided evenly

by p so that N/p = q. The three matrices share the same layout over the 2D grid as follows: Each processor node

Pr,s (0 ≤ r, s< p ) holds all of the Ai,j, Bi,j and Ci,j sub-matrix blocks where i = rq+i’, j = sq+j’and 0 ≤ i’, j’< q .

Thus, any sub-matrix block Xi,j (for X = A, B, C) is deterministically held by node Pr,s , where r = i /q and s = j /q .

4

Then, the MM algorithm is divides MM in N steps, where each step involves three phases. In the k-th step, where

k = 1, 2, …, N, the three phases are:

1. Every node Pr,s that holds Ai, k (for all i =0, 1, … , q-1) multicasts the matrix block horizontally to all nodes

Pr,* , where “*” stands for any qualified number.

2. Every node Pr,s that holds Bk, j (for all j =0, 1, … , q-1) multicasts the matrix block vertically to all nodes

P*,s

3. Every node Pr,s performs the update Ci,j = Ai,k* Bk,j + Ci,j , for all Ci,j held by the node.

Each node must have some temporary space to receive 2q matrix blocks.

Let us now focus on the SPM implementation of MM. For a super-program of MM in SPM, there are N3 SIs.

They are I i,j,k for updating Ci,j = Ai,k* Bk,j + Ci,j , for all of i, j, k =0, 1, … , q-1. If i≠ i’ or j≠ j’, then I i,j,k and I

i’,j’,k’ can be executed in parallel. If only k≠ k’, then I i,j,k and I i,j,k’ , however, can only be executed in sequence.

Since we lack a simple addition SI, they have to be executed on the same node. Under these constrains, any

scheduling policy is acceptable.

3. Scheduling Strategies

Under the SPM model, super-programs can be implemented under various scheduling policies. Because of

various factors that affect performance for MM, we discuss the following few scheduling policies.

3.1 Synchronous Policy

This policy makes super-program of MM imitate a implementation for distributed memory parallel computer

(DMPC). A DMPC consists of a number of nodes. Each node has its own local memory, and there is no global

shared memory. Nodes communicate with each other via message passing. Computations and communications in

a DMPC are globally synchronized into steps. A step is either a computation step or a communication step. In a

computation step, each node performs a local arithmetic/logic operation or is idle. A computation step takes a

constant amount of time for all nodes. If a node finishes its work for the time slot, it also is idle for reminded time.

Under this policy, initially we determine the mapping of sub-matrix blocks of the product matrix Ci,j to

processing nodes. It is similar to the layout described above. Each node is responsible of a part of the product

matrix. But the area needs to neither a rectangle nor continuous. Scheduler schedules super-instructions based on

the round and the mapping. In the k-th round, only SI I i,j,k ( for all of i, j =0, 1, … , q-1) will be scheduled. Only

corresponding updates are performed in this round computation step. When a node is available to execute another

SI, the node is assigned an SI to execute if and only if there is an unassigned SI, which I i,j,k update a sub-matrix

block that is mapped to the node. Otherwise, the node waits for the next round to begin.

Thus, the super-instructions belong to different rounds are never executed in parallel. All SIs belonging to the

same round are executed pseudo-synchronously. So we call this policy synchronous policy. It is clear that this

scheduling policy makes the super-program implementation similar to that for the algorithm described in the

previous section.

5

3.2 Static Scheduling Policy

Similarly to the above synchronous scheduling policy, a static mapping of sub-matrix blocks of C onto

processor nodes is determined at static time. But these SIs are not assigned to the nodes in a synchronous manner.

Once a node is available to execute another SI, the scheduler checks to see if there exists such a qualified

candidate SI waiting for execution; this candidate should correspond to the update of a matrix block that belongs

to this node. If so, no matter what round this SI belongs to, the scheduler will assign the SI to this node for

execution. Otherwise, it leaves the node idle. Thus, a node only executes the SIs that computes sub-matrix blocks

statically assigned to it. Among these nodes, these SIs are executed asynchronously.

In a more restricted (k-inner) manner, the scheduler may first check to see if there exists a qualified candidate

that involves the update of the same matrix block as the SI previously running on the node (with the same i and j).

If yes, it schedules this SI. Otherwise, it schedules any qualified SI. Thus a node computes the sub-matrix blocks

of the product one-by-one.

3.3 Random (Dynamic) Scheduling Policy

In this policy, once a node is free the scheduler first tries to find an unassigned SI that updates the same sub-

matrix block with the previous SI running on this node ( these SIs have the same i and j but different k). If it is

successful, it assigns the new SI to the node. Otherwise, it checks to see if there exist some sub-matrix blocks that

have never been updated. If so, it randomly chooses a sub-matrix block from this set and assigns an SI to update

the block in this node.

Thus, any sub-matrix block can be assigned to a node. Once a block is assigned to a node, the node

exclusively executes all the SIs that update the block. When a node needs more workload, the scheduler will

randomly choose a sub-matrix block.

3.4 Smart (Dynamic) Scheduling with Random Seed policy

This scheduling policy is similar as above Random (Dynamic) Scheduling Policy. The only difference is the

way of choosing a sub-matrix block. When a node needs more workload, the scheduler will choose a sub-matrix

block from all unassigned sub-matrix blocks for the node based on its priority that is determined as follows:

1) For a node, an unassigned sub-matrix block Ci,j has high priority if there exist both a sub-matrix block Ci’,j

and a sub-matrix block Ci,j’ in the blocks that have been assigned to the node.

2) For a node, an unassigned sub-matrix block Ci,j has medium priority if it has not high priority and there

exist either a sub-matrix block Ci’,j or a sub-matrix block Ci,j’ in those blocks that have been assigned to the

node.

3) For a node, an unbound sub-matrix block Ci,j has low priority if there exist neither a sub-matrix block Ci’,j

nor a sub-matrix block Ci,j’ in those blocks that have been assigned to the node.

The scheduler always assigns a block with highest priority to a node. If only low priority blocks exist, a

random choice is made for assignment.

6

3.5 Smart (Dynamic) Scheduling with Static Seed policy

This policy is very similar to the above Smart Scheduling with Random Seed policy except that we now

provide a scheme to choose the first block assigned to each node. When scheduler assigns a sub-matrix block to a

node initially, the list of blocks assigned to it is empty and all unassigned candidate blocks have low priority. In

this special case, the scheduler chooses a block based on a 2D grid arrangement, like static scheduling policy

rather than a random choice.

4. Comparison With Other Parallel Programming Models There exist two distinct popular programming models for parallel programming,. They are the Message-

Passing and Shared-Memory models [18, 22]. Generally, both programming models are independent of the

specifics of the underlying machines. There exist tools to run programs developed under any of these models for

execution on any type of parallel architecture. For example, a language that follows the shared-memory model can

be correctly implemented on both shared-memory architectures and share-nothing multi-computers; this is also

true for a language that follows the message-passing model. For example, ZPL is an array programming language

that follows the shared-memory model [23]. Nevertheless, for distributed memory implementations it employs

either MPI or PVM for communications. Cilk also follows the shared-memory model [20].

The optimization of parallel application programs is done for specific architectures. For this reason, the

Distributed-Shared Memory model has been proposed. UPC does follow this model [17]. It gives up the high-

level abstraction of the shared-memory model and exposes the lower-level system architecture to programmers. It

provides programmers the capability to exploit the distributed features of the architecture, thus increasing their

responsibility in better program implementations.

An array programming language, like ZPL, exploits only data-level parallelism. The higher-level structure in

programs is executed sequentially. POOMA is similar but follows an object-oriented style [25]. A library

provides a few constructs to build logic objects and access them in a global-shared style using data-parallel

computing for high performance. To exploit higher-level parallelism, task-oriented approaches are used where

extracting the parallelism between program structures is a major task for programmers. The runtime environment

exploits this parallelism for task scheduling. A few multithreaded approaches have also been developed for task

mapping to threads. Cilk is such an approach that employs a C-based multithreaded language [20]. The

programmer uses directives to declare the structure of the task tree. All the sibling tasks are executed in parallel

within multi-threads that share a common context in the shared memory. EARTH-C also is such an approach [19,

21].

7

Approaches that exploit data parallelism, like ZPL and POOMA, usually do not mention load balancing

explicitly. Load balancing is done implicitly by the programmer though data partitioning. For task-oriented

approaches, load balancing is normally performed at run time through a combination of assigning tasks to

processors, and often preempting and/or migrating tasks at run time, as deemed appropriate by the run-time

environment. Cilk uses thread migration to balance the workload. Charm++ behaves similarly; it encapsulates

http://www.cs.washington.edu/research/zpl/

http://www.codesourcery.com/pooma/pooma_overview

http://charm.cs.uiuc.edu/

working threads and relevant data into objects. The system balances the workload by migrating these objects at

run time.

Our SPM model employs load control technology for load balancing. Only a limited number of SIs are

originally assigned to processors, where each SI has limited workload. All other SIs are left in an SI queue. When

a processor completes the execution of an SI, the system automatically assigns it another SI. This mechanism

guarantees that no processor is either overloaded or underloaded. Thus, the workloads are balanced automatically

among the processors. Our approach, similar with Cilk, employs a greedy utilization scheduler at run time.

However, our scheduling approach is not based on processes and does not migrate threads, thus eliminating the

overhead of task migration. On a shared-memory system, thread migration with Cilk has very low overhead (the

implementers claim 2-6 times the overhead of a function call). However, this cost does really exist. [20] suggests

a high utilization schedule on a shared-memory system may be better than a maximum utilization schedule. It

means that if the difference in processor speeds is not large enough, migrating a thread from a slow processor to a

faster processor is not better than keeping the thread on the original (slow) processor. On a shared-nothing system,

the cost of thread migration is much higher than that on a shared-memory system. In this case, a performance

model that considers thread migration but ignores its cost cannot be expected to provide a very accurate

prediction.

Generally, SPM is a non-preemptive, multi-threaded execution model. It was first tested with a data-mining

problem [24].

In the above mentioned projects, the life time of threads is determined by the programmer or compiler. In

SPM, it is controlled by the run-time environment. The programmer only requests thread creation but threads are

created by the runtime system based on a load control policy. Load control technology has been used for many

years to improve the performance of multitasking in centralized (single processor) computer systems. Both Cilk

and Charm++ do not use any load control. When a function is invoked, the runtime environment of Cilk creates a

new thread and adds the load to a processor of the system no matter how busy the system is. The threads are

similar to SPM SIs. But an SI under SPM is assigned to a processor only when enough resources are available.

Thus, SIs never need to be migrated.

In all of the parallel programming models, the addressable data are treated similarly to data for the sequential

programming model. In the message passing model, an application uses a set of cooperating concurrent sequential

processes. Each process has its own private space and runs on a separate node. Except for message exchanges

among these processes, the processes behave like sequential. For the shared memory model, the programmer can

access any non-hidden data. For example, in Cilk or UPC programmers can access all data as ANSI C. In SPM,

similarly to the shared-memory model, there is global data that has higher-level abstraction. Primary data are

super-data blocks which are pure logic objects in the objects domain and are associated with a given application

domain; the primary operations are carried out by super-instructions. These super-data may map to a large block

of underlying data by the runtime system. But unlike all the aforementioned models, programmers can operate on

an object only as an atom with an SI. They can never access their internal components. Thus, application 8

http://charm.cs.uiuc.edu/

programmers can only access these objects rather than lower-level block structures. This gives programmers a

coarse-grain data space.

5. Performance Analysis

Let us begin with some definitions. The ideal (inherent) workload Wideal of a piece of program is defined as

the total number of its ordinary instructions for its optimal sequential implementation. The ideal (inherent)

performance Pideal of a computer is defined as the rate to execute ordinary instructions per second. The shortest

execution time is T = Wideal / Pideal. The ideal (peak) performance of a cluster system with multiple computers is

defined as the sum of the performance of all member computers. The (real) workload W of a parallel program run

on a PC cluster is defined as the product of the overall execution time and the system (peak) performance. ( W =

(∑Pi )* T ). The workload W is definitely larger than the ideal workload Wideal. The difference between W and

Wideal is the so called overhead.

When a super-problem is implemented in parallel and the program runs on multiple nodes, the system needs

to use not only its computing resources to solve the problem but its communications resources as well in order to

exchange data with other nodes. For example, when a super-program runs on a PC cluster, the system not only

executes SI on its member nodes but it may also need to remotely load operands for the SIs. The latter will

increase the workload. It is the so called communication overhead Wcomm. When the tasks distributed among the

nodes are not balanced, some nodes will finish their tasks before others. This means that these nodes will be idle

for some time. This increases the overall time of solving the problem. We called this the unbalance overhead or

idle overhead Widle. The scheduler also needs to carry out some computation to correctly schedule SIs. It will

introduce some management overhead Wman.

Thus, W = Wideal +Widle + Wcomm +Wman (4-0)

The relative overheads are widle = Widle / Wideal , wcomm = Wcomm / Wideal and wman = Wman /Wideal respectively

Below we will build overhead models to analyze unbalance overhead of SPM program. Only simply discuss

management overhead at end of Section 5.

Under this model, a parallel program is modeled as a sequence of synchronous steps. Each step ends with

synchronization of the nodes, but inside a step there is not any synchronization needed. Within a step all nodes

execute as quickly as possible all tasks assigned to them in this step. Then, they wait for all the nodes to finish

their assigned tasks. In this case, the unbalance overhead is the sum of the products of the idle times Ti-idle of each

node i multiplied by its peak performance Pi (i.e Widle = ∑Pi* Ti-idle ). In a homogeneous environment (Pi = P0),

the unbalance overhead is Widle = P0 * ∑( Ti-idle ). For matrix multiplication, each sub-matrix can be computed

independently. There is no inherent synchronization requirement. Super-programs for matrix multiplication can

have only one synchronous step. A synchronous algorithm with a synchronous policy, however, may split the

program into many steps requiring synchronization.

For the sake of simplicity, without loss of generality, we assume that both operands of multiplication, A and

B, are partitioned into sub-matrix blocks of size N x N. The product matrix C is also partitioned into sub-matrix 9

blocks of size N x N. All of these sub-matrix blocks are n1 x n2 matrices. We assume that the workload of an SI,

which multiplies a pair of sub-matrices Aik * Bkj and updates the block Cij, conforms to statistical rule R0

corresponding to average workload L0 and deviation D0. Also, the program is executed on a PC cluster with n = p2

nodes.

For the synchronous algorithm, in each step a node carries out (N/p)*(N/p) multiplications. Each is applied

for a sub-matrix of C. Thus, in each step the expected workload of a node is

Lsyn = (N/p)*(N/p)* L0 (4-1)

and the deviation is Dsyn = (N/p)* D0 . We still estimate the practical maximum workloads Lmax-syn = Lsyn

+Dsyn. Then, the unbalance overhead of the system in a given step is (n-1)*( Lmax-syn – L2 ) = (n-1)* D2 and the

total unbalance overhead of the system for the N steps is

Widle-syn = N* (n-1)* Dsyn = (n-1)/n1/2 * (N2 * D0) ≈ n1/2 * (N2 * D0) (4-2)

For static scheduling, each node is responsible of computing (N/p)*(N/p) sub-matrices and the computation of

a sub-matrix of C involves N steps of multiplying a pair of sub-matrices Aik * Bkj. If N is large enough, according

to the large number theorem the expected workload of a node is Ls = (N/p)*(N/p)*N* L0 and the deviation is Ds =

(N/p)*(N/p)*N1/2* D0 = N3/2 / n1/2 * D0. We estimate the maximum workload in the p2 nodes to be Ls-max = Ls

+Ds. Then, the unbalance overhead of system is

Widle-s = (n-1)*( Ls-max – Ls ) = (n-1)* Ds = (n-1)/n1/2 * (N3/2 * D0) ≈ n1/2 * (N3/2 * D0) (4-3)

For any dynamic scheduling, we assume that the unit of assignment corresponds to computing a sub-matrix of

C rather than performances Aik * Bkj . In this case, the expected workload of an assignment is L1 = N* L0 and the

deviation is D1 = N1/2 * D0. When the last assignment is made to a node, all other nodes are doing some

computation and do not have any chance to be idle. If we estimate the average workload reminded for other nodes

to be half of L1 , then the expected value of unbalance overhead is

Widle-dyn = (n-1)*1/2 * L1 = (n-1)*N/2 * L0 ≈ n/2 *N * L0. (4-4)

Comparing Widle-s with Widle-syn, we can find Widle-syn= N1/2 * Tidle-s. Thus, synchronous algorithm always has

N1/2 times larger unbalance overhead than an asynchronous algorithm with a static scheduling policy. We also

have

Widle-s / Widle-d = 2 * (N/n)1/2 *( D0/ L0) (4-5)

Thus, when the deviation of the basic operation (it corresponds here to the multiplication of a pair of sub-

matrices) is insignificant, then static scheduling has less unbalance overhead. However, if the deviation of the

basic operation is significant, when N > n *( L0/ D0), Tuo-s / Tuo-d > 1. Thus, static scheduling has heavier

unbalance overhead than any dynamic scheduling.

6. Simulation

6.1 Simulation Program

Any SI is simulated as an abstract task with a workload that is determined at static time. A member

computer is simulated as an abstract “Node” object with a peak performance parameter. Each node records its

10

accumulated idle time, the time for loading data and the time for executing SIs. It also keeps track of all tasks

that have assigned to the node, the next task it will finish and the time it finishes the task. All nodes are put in a

queue based on the time they will finish execution of their next task. Scheduling algorithms are implemented

rather than simulated. After nodes have been assigned their initial tasks, in each step simulator finds the next

task to be ended, get the time that the task would be finished and the node that executes the task. The system

time is updated to the time that the task would be finished, which is published to all nodes; the scheduler

schedules new tasks to the node; and then the node would be put into queue if it is still active. The simulator

keeps running until all SIs have been executed and all nodes have become idle. In our simulation, the static

schedule was created manually in advance. Except for some parameters that were determined at run time by the

scheduler, all others simulation parameters were generated in advance based on the statistical distribution

features of the parameters. These include the workloads of the SIs. Except for the simulation of synchronous

multicast, in simulations tasks scheduled as a group of SIs update the same sub-matrix block in C.

6.2. Parameters Chosen

The average time to execute an SI is 10000 units of time. The parameter is a basic parameter. Some

networking parameters are chosen based on some pilot program and others are chosen arbitrarily. We choose the

size of all matrices to be 32 x 32. All elements of them are sparse square matrices having same dimension. We

assume that the workload of SIs conform with the uniform distribution and the deviation of maximum/minimum

values is the same as the average value. This means that the workload is distributed uniformly between 0 and

20000.

The system runs under the CPU bound condition. For loading a sub-matrix block remotely, a sending node

requires 250 units of time and a receiver take 250 units so that the total cost to load the two operands of an SI

remotely is 10% of its execution time. In our pilot program, we found for our PC cluster system that takes about

20% CPU time to handle high-level object communication when the networking resource is fully used. Under

the CPU bound condition, the networking resource may not be fully used. So we assume that the total

percentage of time to load the operands of an SI remotely is 10%.

The number n of PCs in the cluster is chosen as 1, 2, 4, 8, 16, 32, 64, 128 and 256. In homogeneous

environment, the peak performance of any node is 1 unit. In heterogeneous environment, the relative peak

performance of any node with an odd index is 3 units and the relative peak performance of any node with an

even index is 1 unit.

For the static strategy, the layout of sub-matrix blocks statically is scheduled as shown in Table 1 for 4

nodes. The left column and the top row are the indices of sub-matrix blocks. The numbers in the tables are the

index of the nodes that the sub-matrix block is assigned to. For other cases with a different number n of nodes in

the cluster (n = 1 to 256), similar schemes are given. For the Smart scheduling with Static seed (SS strategy),

the initial task assigned to each node is chosen to be the same as that for the static strategy.

11

Table 1 Static schedule for a cluster with 4 nodes

(a) a homogeneous environment (b) a heterogeneous environment 0-15 15-31 0-11 12-19 30-31

0-15 0 2 0-15 0 1 2 15-31 1 3 15-31 0 3 2

For multicasts, the membership of nodes in the multicast group is determined statically as follows for all

strategies. For multicasting the sub-matrix blocks of matrix A, all nodes located in a same row in Table 1 form a

group; for multicasting the sub-matrix blocks of matrix B, all nodes located in a same column in Table 1 form a

group. When a node requests a data block, all nodes in the same group will receive this data block.

In our simulation, the sub-matrix blocks of A and B are cached separately. The unit of cached data is a row

of sub-matrix blocks of A and a column of sub-matrix blocks of B. The size of the local cache is the number of

rows of A and columns of B cached. The size of the cache is chosen to be 1-32. The size of 32 means that every

node can cache 32 rows of A and 32 columns of B. Therefore, every node can fully cache all data it needs.

All simulations repeat 16 times with different data sets. The results are the average values of multiple

simulation runs.

6.3. Results of Simulations

Simulations results of for different scheduling algorithms in a homogeneous environment are shown in Fig

1. Syn represents the synchronous algorithm; S represents the static scheduling algorithm; R represents the

random scheduling algorithm; SS represents the smart scheduling algorithm with a static initial scheme; SR

represents the smart scheduling algorithm with a random initial scheme. Excepts for Syn, all others try to use

both unicast and multicast operations for data communications.

The results for different scheduling algorithms in a heterogeneous environment are shown in Fig. 2.

The execution times of the simulation program in various cases are shown in Fig.3. They are the upper

bound of management overheads. u and m represents the program that uses unicast and multicast for

communication, respectively. H and R represents the programs run in a homogeneous and heterogeneous

environment, respectively.

From Fig.1, we can find that synchronous algorithms has much higher unbalance overhead although the

synchronous, algorithm are accepted widely in parallel implementations for MM. For high performance, it is

favored to asynchronously update those independent sub-matrix blocks to perform MM on the PC cluster. Thus,

12

when there exists diverse workload between updating different sub-matrix blocks, the super programs in SPM

are more appropriate to perform MM in the PC cluster environment than the programs in synchronous fashion

for synchronous fashion for DMPC.

01SynSRSSSR0

0.005

0.

0.015

unicast multicast(a) nodes

0

0.01

0.02

0.03

0.04

unicast multicast(b) 4 nodes

SynSRSSSR

0

0.02

0.04

0.06

0.08

unicast multicast(c) 8 nodes

SynSRSSSR 0

0.05

0.1

0.15

unicast multicast(d) 16 nodes

SynSRSSSR

00.050.1

0.150.2

0.25

unicast multicast(e) 32 nodes

SynSRSSSR 0

0.1

0.2

0.3

0.4

unicast multicast(f) 64 nodes

SynSRSSSR

00.10.20.30.40.50.6

unicast multicast(g) 128 nodes

SynSRSSSR 0

0.2

0.40.6

0.8

unicast multicast(h) 256 nodes

SynSRSSSR

Fig 1 comparison of unbalance overhead of super-programs with different strategies in

homogeneous environment for PC clusters with different number of nodes.

This set of results also shows that dynamic scheduling algorithms always have less unbalance overhead than

static scheduling algorithms no matter if unicasting or multicasting is used for data communication. The

13

00.0010.0020.0030.0040.005

unicast multicast(a) 2 nodes

SRSSSR

0

0.005

0.01

0.015

unicast multicast(b) 4 nodes

SRSSSR

00.0050.01

0.0150.02

0.025

unicast multicast(c) 8 nodes

SRSSSR

00.010.020.030.040.05

unicast multicast(d) 16 nodes

SRSSSR

0

0.05

0.1

0.15

unicast multicast(f) 64 nodes

SRSSSR

00.020.040.060.080.1

unicast multicast(e) 32 nodes

SRSSSR

difference in the unbalance overhead between static and dynamic scheduling depends on the average number of

tasks executed on a node . The more tasks a node executes, (less nodes in the cluster), the larger the difference.

When there are only 4-8 tasks per node (case g and h), the different is small. In other cases, the different is

significant. Thus, for large problems, dynamic strategies have an advantage to reduce the unbalance overhead.

The difference in the unbalance overhead among different (dynamic) scheduling strategies,

00.050.1

0.150.2

0.25

unicast multicast(g) 128 nodes

SRSSSR

00.10.20.30.40.50.6

unicast multicast(h) 256 nodes

SRSSSR

Fig 2 comparison of unbalance overhead of super-programs with different strategies in

heterogeneous environment for PC clusters with different number of nodes.

14

(a)

0100200300400500

1 2 4 8 16 32 64 128 256number of nodes

(ms)

HuHmRuRm

(b)

0100200300400500

1 2 4 8 16 32 64 128 256number of nodes

(ms)

HuHmRuRm

(c)

0

100

200

300

400

1 2 4 8 16 32 64 128 256number of nodes

(ms)

HuHmRuRm

(d)

0100200300400500

1 2 4 8 16 32 64 128 256number of nodes

(ms)

HuHmRuRm

Fig 3 comparison of management overhead of super-programs with different strategies

(a) static strategy (c) smart strategy with static initials

(b) random strategy (d) smart strategy with random initials.

however, is very small in a homogeneous environment

Comparing the results for a heterogeneous environment shown in Fig2 with corresponding results for a

homogeneous environment shown in Fig 1, we can see that the basic feature is similar. But the unbalance

overhead increases. This is because in a heterogeneous environment the nodes have different peak performance.

The slower nodes have more chance to execute the last SI. Other nodes with higher peak performance may be

idle at the same time, thus wasting more computation capability. The increase in the dynamic strategies is larger

than the increase in the static strategy. This makes the advantage of dynamic strategies to decrease. The critical

value is increased to 32 tasks per node. Fig 2 (e). When there are only 4-8 tasks per node, (case g and h), the

static strategy becomes the favor. This indicates that in a heterogeneous environment the dynamic strategies also

are favorable to reduce the unbalance overhead. But they only are helpful for large problems

In Fig.3, we can know that the execution times of simulation programs are all less than 400ms. Many cases,

the execution times are less than 100 ms. Since the scheduling algorithms were implemented rather than

simulated, the management overhead for both static and dynamic scheduling is less than 400 ms for the

multiplication of two 32x32 matrices. Considering there that are 32K SIs to be executed in multiplying two

32x32 matrices, and each SI may take a few ms, the relative management overhead of SP is very low. The

average time to execute an SI depend on the size of sub-matrix blocks that is a designed parameter. If the

average time is 4ms, the maximum management overhead is only 0.3% of total execution time. In many cases,

they are less than 0.1%. Compared to the unbalance overheads, it can be ignored. Comparing the figures for 15

different scheduling strategies, we can find that smart scheduling strategies have a little higher management

overhead than the random strategy and the static strategy. But the difference is small. The execution times to

simulate super-programs in a heterogeneous environment have not big different from those in a homogeneous

environment in all cases. Thus, it is appropriate to applying these strategies in both a homogenous and

heterogeneous environment

It should be indicated that, for embedded problems using the static algorithm, finding the static scheme

makes some additional management overheads. The research in [13] proves that in a heterogeneous environment

finding the optimal static scheme itself is not a trivial task. It is an NP-complete problem. A dynamic approach

does not have this additional overhead. All management overheads is included in the implementation of

scheduling.

7. Experimental Results on a PC Cluster All the experiments were performed on a PC cluster with eight nodes. The random/dynamic scheduling

approach was followed. Each dual-processor node is equipped with AMD Athlon processors running at 1.2

GHz. Each node has 1GB of main memory, a 64K Level-1 cache and a 256K Level-2 cache. All the nodes are

connected through a switch to form an Ethernet LAN. Each link has 100Mbps bandwidth. All the PCs run Red

hat 9.0 and have the same view of the file system by sharing files via an NFS file server. All data are obtained

from this file server. We used a set of synthetic sparse matrices of size 8192x8192; they contain 5% non-zero

elements. These matrices were partitioned into 32x32 sub-matrices of size 256x256. Non-zero elements were

generated randomly and contained double-precision floating-point numbers between 1.0 and 2.0. The location

of non-zero elements was chosen randomly. The super-program was developed manually using the super-

instructions described in Section 2. We implemented a simple runtime environment (i.e., a virtual machine),

which supports our super-programming model. The runtime environment consists of a number of virtual nodes

(each residing in a PC node). Super-instructions are scheduled to run on these virtual nodes. The runtime

environment also caches recently used super-data blocks. The runtime environment and the super-instructions

were implemented in the Java language. A special function was added in the code to collect information at

runtime about the execution of Sis and the utilization of individual PCs. The execution times for matrix

multiplication were extracted from this collected time information.

To check how many threads were needed for the system to run under the CPU bound condition, we first run

the program on just two PCs. Each PC was allowed to launch n threads to execute n SIs simultaneously, for n =

4, 8, 12 and 16. The results are shown in Fig. 4. The execution with 4 threads has much higher execution time.

Using 8 or more threads for each node, the total execution times are very close. Thus, 8 threads can hide the

long latency of loading remotely super-data blocks (operands). To check the speedup of the program, we run it

with n nodes, for n = 1, 2, 3, 4, 5, 6 and 8. The results are shown in Fig. 5. When using 4 threads on each node,

16

the system may run under the delay bound condition. Even the speedup in this case (4T) is better than the case

(8T) where 8 threads are allowed in each node, the total performance in the former case is worse than the latter.

0

50

100

150

200

4 8 12 16Number of Thread

Exec

utio

n Ti

me

of

MM

(Sec

onds

)

Fig. 4. The effect of the number of threads per node on the performance of the super-program

0

100

200

300

400

1 2 3 4 5 6 8Number of Nodes

Exec

utio

n Ti

me

of

MM

(sec

ond)

4T8T

0123456

1 2 3 4 5 6 8Number of Nodes

Spee

dup

4T8T

(a) (b)

Fig.5. The execution time and speedup for different numbers of nodes

8. Conclusions 1. Super-programs in SPM can exploit high-level parallelism of problems by executing coarse/medium-grain

tasks parallel. SPM is appropriate to parallel implement various algorithms.

2. The overhead of performing sparse matrix multiplication in parallel involves not only communication

overheads but also unbalance overheads. For an embedded problem, it also involves management overhead.

If the workload of block operations varies, such as for sparse matrix multiplication, the algorithms with

fine–level synchronization have high unbalance overhead. Algorithms using coarse-grain tasks have usually

less unbalance overhead. Thus, exploiting high-level parallelism makes super-programs can execute tasks

asynchronously. This can avoid high unbalance overhead. Dynamic scheduling of super-programs can

reduce the unbalance overhead further effectively. Algorithms using dynamic scheduling policies have less

unbalance overhead than those involving a static scheduling policy.

3. The management overhead for dynamic algorithm is similar among different scheduling strategies. There is

no additional management overhead for embedded problems no matter in a homogeneous or heterogeneous

environment. All management overheads are included in the execution of scheduler. They are insignificant.

17

In a heterogeneous environment, management overhead of static strategy for embedded problems is

expected to be high. Thus, super-programs in SPM with dynamic strategies have a great advantage in a

heterogeneous PC cluster.

References

[1] I. S. Duff , M. A. Heroux , R. Pozo An Overview of the Sparse Basic Linear Algebra Subprograms: The

New Standard from the BLAS Technical Forum, ACM Trans. on Math. Software (TOMS) Vol28(2), p239-

267, Jun. 2002

[2] K. Li, Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers, Proc. Parallel

and Distributed Processing Symposium (IPDPS), 2000. p307 –314, 2000

[3] T.I. Tokic, E.I. Milovanovic, N.M. Novakovic, I.Z. Milovanovic, M.K. Slojcev, Matrix Multiplication on

Non-planar Systolic Arrays, 4th International Conf. on Telecommunications in Modern Satellite, Cable and

Broadcasting Services, 1999, Vol2, p514 -517, 1999

[4] K. Dackland , Bo Kågström, Blocked Algorithms and Software for Reduction of a Regular Matrix Pair to

Generalized Schur Form, ACM Trans. on Math. Software (TOMS) Vol25 (4), p425-454 Dec. 1999

[5] Jaeyoung Choi, A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent

Computers, High Performance Computing (HPC) Asia '97, p224 –229, May 1997

[6] Iain S. Duff , Michele Marrone , Giuseppe Radicati , Carlo Vittoli, Level 3 Basic Linear Algebra

Subprograms for Sparse Matrices: a User-Level Interface, ACM Trans. on Math. Software (TOMS) Vol23

(3), p379-401,Sept. 1997

[7] A.E. Yagle, Fast Algorithms for Matrix Multiplication Using Pseudo-Number-Theoretic Transforms, IEEE

Trans. on Signal Processing, Vol. 43(1), p71 -76, Jan 1995

[8] Hyuk-Jae Lee; J.A.B. Fortes, Toward Data Distribution Independent Parallel Matrix Multiplication, Proc.

9th Int. Parallel Processing Sym., p436 –440, Apr 1995

[9] T. Hopkins, Renovating the Collected Algorithms from ACM, ACM Trans. on Math. Software (TOMS)

Vol28 (1) p59-74, Mar. 2002

[10] Bo Kågström , Per Ling , Charles van Loan, GEMM-based Level 3 BLAS: High-performance Model

Implementations and Performance Evaluation Benchmark, ACM Trans. on Math. Software (TOMS) Vol24

(3), p268-302, Sept. 1998

18

http://portal.acm.org/citation.cfm?id=567810&coll=portal&dl=ACM&CFID=2417440&CFTOKEN=31882759


http://ieeexplore.ieee.org/iel5/6818/18286/00846000.pdf?isNumber=18286&prod=IEEE+CNF&arnumber=846000&arSt=307&ared=314&arAuthor=Keqin+Li%3B

http://ieeexplore.ieee.org/iel5/6528/17421/00806262.pdf?isNumber=17421&prod=IEEE+CNF&arnumber=806262&arSt=514&ared=517+vol.2&arAuthor=Tokic%2C+T.I.%3B+Milovanovic%2C+E.I.%3B+Novakovic%2C+N.M.%3B+Milovanovic%2C+I.Z.%3B+Slojcev%2C+M.K.%3B

http://ieeexplore.ieee.org/iel5/6528/17421/00806262.pdf?isNumber=17421&prod=IEEE+CNF&arnumber=806262&arSt=514&ared=517+vol.2&arAuthor=Tokic%2C+T.I.%3B+Milovanovic%2C+E.I.%3B+Novakovic%2C+N.M.%3B+Milovanovic%2C+I.Z.%3B+Slojcev%2C+M.K.%3B



http://ieeexplore.ieee.org/iel3/4600/12954/00592151.pdf?isNumber=12954&prod=IEEE+CNF&arnumber=592151&arSt=224&ared=229&arAuthor=Jaeyoung+Choi%3B

http://ieeexplore.ieee.org/iel3/4600/12954/00592151.pdf?isNumber=12954&prod=IEEE+CNF&arnumber=592151&arSt=224&ared=229&arAuthor=Jaeyoung+Choi%3B



http://ieeexplore.ieee.org/iel4/78/8370/00365287.pdf?isNumber=8370&prod=IEEE+JNL&arnumber=365287&arSt=71&ared=76&arAuthor=Yagle%2C+A.E.%3B

http://ieeexplore.ieee.org/iel2/3156/8974/00395967.pdf?isNumber=8974&prod=IEEE+CNF&arnumber=395967&arSt=436&ared=440&arAuthor=Hyuk-Jae+Lee%3B+Fortes%2C+J.A.B.%3B




[11] Jack J. Dongarra , L. S. Blackford , J. Choi , A. Cleary , E. D'Azeuedo , J. Demmel , I. Dhillon , S.

Hammarling , G. Henry , A. Petitet , K. Stanley , D. Walker , R. C. Whaley, ScaLAPACK user's guide

(Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997)

[12] Nicholas J. Higham, Exploiting Fast Matrix Multiplication within the Level 3 BLAS, ACM Trans. on Math.

Software (TOMS) Vol16 (4), p352-368, Dec 1990

[13] O. Beaumont, V. Boudet, F. Rastello, Y. Robert, Matrix Multiplication on Heterogeneous Platforms IEEE

Trans. on Parallel and Distributed Systems, Vol.12 (10), p1033 –1051, Oct 2001

[14] S. Huss-Lederman, E.M. Jacobson, A. Tsao, Comparison of Scalable Parallel Matrix Multiplication

Libraries, Proc. Conf. Scalable Parallel Libraries, p142 –149, Oct 1993

[15] Keqin Li, Yi Pan, Si Qing Zheng, Fast and Processor Efficient Parallel Matrix Multiplication Algorithms

on a Linear Array with a Reconfigurable Pipelined Bus System, IEEE Trans. on Parallel and Distributed

Systems, Vol.9 (8), p705 –720, Aug 1998

[16] Mi-Jung Noh, Youngsik Kim, Tack-Don Han, Shin-Dug Kim, Sung-Bong Yang, Matrix Multiplications on

the Memory Based Processor Array, High Performance Computing on the Information Superhighway, (

HPC) Asia '97, p377 –382, May 1997

[17] T. El-Ghazawi and F. Cantonnet, UPC performance and potential: A NPB experimental Study, Proc. 2002

ACM/IEEE conf. on Supercomputing (SC’02), p1-26, Nov. 2002.

[18] R. Jin and G. Agrawal, Performance prediction for random write reductions: a case study in modeling

shared memory programs, ACM SIGMETRICS intern.conf. on Measur. and Model. of computer systems,

p117 – 128, 2002

[19] G. M. Zoppetti, G. Agrawal, L. Pollock, J. N. Amaral, X. Tang and G. Gao, Automatic compiler techniques

for thread coarsening for multithreaded architectures, Proc. 14th intern. Conf. on Supercomputing (SC’00),

p306-315, may2000

[20] M. A. Bender and M. O. Rabin, Scheduling Cilk multithreaded parallel programs on processors of different

speeds Proc. 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 13-21, 2000.

[21] A. Sodan, G. R. Gao, O. Maquelin, J.Schultz and X.Tian Experiences with non-numeric applications on

multithreaded architectures . Proc. of 6th ACM SIGPLAN symposium on Principles & practice of parallel

programming, Vol32 (7)

[22] P. Mehra, C. H. Schulbach and J. C. Yan A comparison of two model-based performance-prediction

techniques for message-passing parallel programs Proc. 1994 ACM SIGMETRICS conference on

Measurement and modeling of computer systems, Vol. 22 (1) p181-190, 1994

19

http://portal.acm.org/citation.cfm?id=265932&dl=ACM&coll=portal&CFID=2417440&CFTOKEN=31882759


http://ieeexplore.ieee.org/iel5/71/20797/00963416.pdf?isNumber=20797&prod=IEEE+JNL&arnumber=963416&arSt=1033&ared=1051&arAuthor=Beaumont%2C+O.%3B+Boudet%2C+V.%3B+Rastello%2C+F.%3B+Robert%2C+Y.%3B

http://ieeexplore.ieee.org/iel2/2953/8381/00365573.pdf?isNumber=8381&prod=IEEE+CNF&arnumber=365573&arSt=142&ared=149&arAuthor=Huss-Lederman%2C+S.%3B+Jacobson%2C+E.M.%3B+Tsao%2C+A.%3B

http://ieeexplore.ieee.org/iel2/2953/8381/00365573.pdf?isNumber=8381&prod=IEEE+CNF&arnumber=365573&arSt=142&ared=149&arAuthor=Huss-Lederman%2C+S.%3B+Jacobson%2C+E.M.%3B+Tsao%2C+A.%3B

http://ieeexplore.ieee.org/iel4/71/15311/00706044.pdf?isNumber=15311&prod=IEEE+JNL&arnumber=706044&arSt=705&ared=720&arAuthor=Keqin+Li%3B+Yi+Pan%3B+Si+Qing+Zheng%3B

http://ieeexplore.ieee.org/iel4/71/15311/00706044.pdf?isNumber=15311&prod=IEEE+JNL&arnumber=706044&arSt=705&ared=720&arAuthor=Keqin+Li%3B+Yi+Pan%3B+Si+Qing+Zheng%3B

http://ieeexplore.ieee.org/iel3/4600/12954/00592177.pdf?isNumber=12954&prod=IEEE+CNF&arnumber=592177&arSt=377&ared=382&arAuthor=Mi-Jung+Noh%3B+Youngsik+Kim%3B+Tack-Don+Han%3B+Shin-Dug+Kim%3B+Sung-Bong+Yang%3B

http://ieeexplore.ieee.org/iel3/4600/12954/00592177.pdf?isNumber=12954&prod=IEEE+CNF&arnumber=592177&arSt=377&ared=382&arAuthor=Mi-Jung+Noh%3B+Youngsik+Kim%3B+Tack-Don+Han%3B+Shin-Dug+Kim%3B+Sung-Bong+Yang%3B


http://portal.acm.org/results.cfm?query=author%3AP347521&querydisp=author%3ARuoming%20%20Jin&coll=portal&dl=ACM&CFID=2417440&CFTOKEN=31882759

http://portal.acm.org/results.cfm?query=author%3AP93478&querydisp=author%3AGagan%20%20Agrawal&coll=portal&dl=ACM&CFID=2417440&CFTOKEN=31882759









[23] C. Lin and L. Snyder, "ZPL: An Array Sublanguage," Languages and Compilers for Parallel Computing,

U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds, pp. 96-114, 1993.

[24] D. Jin and S.G. Ziavras, Load-Balancing on PC Clusters With the Super-Programming Model, Worksh.

Compile/Runtime Techniques for Parallel Computing, (ICPP03), Kaohsiung,Taiwan, Oct. 6-9, 2003 (to

appear).

[25] S. Atlas, S. Banerjee, J. Cummings, P.J. Hinker, M. Srikant, J.V.W. Reynders, and M. Tholburn, POOMA:

A high performance distributed simulation environment for scientific applications, SuperComputing '95

20

Session 3: Scientific and Engineering Applications

A Scene Partitioning Method for Global

Illumination Simulations on Clusters

M. Amor, J. R. Sanjurjo, E. J. Padron, A. P. Guerra and R. Doallo

August 30, 2003

Abstract

Global illumination simulation is highly critical for realistic im-age synthesis; without it the rendered images look flat and synthetic.The main problem is that the simulation for large scenes is a hightime-consuming process. In this paper we present a uniform partition-ing method in order to split the scene domain into a set of disjointsubspaces which are distributed among the processors of a distributedmemory system. Since polygons partially inside several subspaces haveto be clipped, we have developed a new clipping algorithm to managethis situation. Memory and communication requirements have beenminimized in the parallel implementation. Finally, in order to evaluatethe proposed method we have used a progressive radiosity algorithm,running on two different PC clusters with modern processors and net-work technologies as a testbed. Good results in terms of speedup havebeen obtained.

keywords:

Computer Graphics, Global illumination method, Progressive Radiosity,Cluster Computing, Scene Partitioning

1 Introduction

Achieving a good partitioning of data is a critical issue to address in order toget efficient execution times for scientific simulations on high performanceparallel programs. This task becomes even more important in distributedmemory systems. In such platforms the effective distribution of data among

1

processor memories is very important in avoiding data replication (mem-ory constraint), minimizing communication among nodes (communicationconstraint) and achieving a good load balance (balance constraint).

Such a partitioning is commonly found by solving a graph partitioningproblem [14]. Therefore, it is a question of splitting the vertices of the graphinto p disjoint subspaces (assuming p processors), trying to make the vertexweight in the subspaces balance, and minimize the sum of the weights of theclipped edges which are connecting vertices in different subspaces.

Good examples of high time-consuming scientific simulations can befound in the field of Computer Graphics, where the quest for realistic syn-thetic images is leading to the use of more and more complex and com-putationally expensive global illumination models. Nowadays, one of themost popular ones is radiosity, that based on the real behavior of light in aclosed environment, produces high-quality images. However, even the mostadvanced and optimized algorithms for the radiosity computation, such asprogressive radiosity [3] or hierarchical radiosity [8], still have a significantcomputational cost and high memory requirements. So, decreasing the exe-cution time consumed for rendering large and complex scenes is needed.

In the last years several works about the parallelization of the radiositymethod have been proposed under different architectures [1, 6]; the mostrecent works are focusing on distributed memory systems [13, 15, 5, 11, 10],either on multicomputers or clusters, because of the good scalability andflexibility of these systems. As we stated previously, it is in these kinds ofmachines in which a good partitioning and mapping of input data onto theprocessors are vitally important.

In this work we have focused on an efficient partitioning algorithm us-ing a clipping method which we have developed. We have applied thisalgorithm to the parallel implementation of a progressive radiosity methodon distributed memory machines. The resulting method works with generalpolygon scenes and intend to solve some of the main problems that the exist-ing implementations present: they were not suited for all kind of scenes [13],or they do not use a very accurate visibility algorithm [15]. It should beremarked that the progressive radiosity method was used in this paper as atestbed for checking the benefits of our partitioning algorithm for 3D imagesynthesis; however, we actually think that similar results could be obtainedusing all those algorithms which work better with scenes with high visibilitybetween closer elements and low visibility between farther ones.

The paper is organized as follows: in Section 2 we describe the partition-ing method; in Section 3 we explain the clipping method used to split poly-gons spanning two or more subspaces; in Section 4 we present the parallel

2

progressive radiosity algorithm resulting from the application of our parti-tioning method; experimental results on two different clusters are shown inSection 5 and finally, in Section 6 some conclusions are highlighted.

2 Scene partitioning method

Within the broad existing partitioning techniques, we have focused on par-allel static graph partitioning algorithms, since the progressive radiositymethod (unlike, for instance, hierarchical radiosity method) does not usean evolving data structure; this way it is not necessary to apply more com-plex dynamic graph partitioning methods. Within static techniques we havechosen geometric techniques, as we are dealing with geometric scenes andthese techniques get an extremely fast partition based mainly on geometricinformation. Specifically, the applied algorithm is a variant of the Coordi-nate Nested Dissection (CND) algorithm [14], a geometric technique thatattempts to minimize the boundary between the subspaces (and that waythe inter-processor communications) by splitting the mesh in two halvesacross its longest dimension.

We have used a CND version that makes a parallel and geometric uni-form partition of the input scene. It splits the scene domain in a set ofdisjoint subspaces of the same shape and size, and it assigns all the objectscontained in a subspace to the same processing element. This kind of parti-tioning is computed quickly and straightforward and it is convenient to usethe visibility masks technique [1]. It also provides a high data locality, min-imizing inter-processors communication. However, this kind of partitioningtends to result in unbalanced partitions, which will represent a penalty inthe total execution time of the parallel algorithm.

Our implementation takes into account the following considerations:

• There are as many subspaces as processing elements and, because ofthe uniform partition, they must be equal in size. This means thatfor an odd number of processors and, therefore, of subspaces, we cannot divide a scene in two halves and then, recursively, split one ofthe children in two new halves (see Figure 1(a)) because resultingsubspaces would not have equal size. So the implemented algorithmis not recursive and the number of splitting planes that are needed tosplit the scene along each coordinate axis is computed in advance (seeFigure 1(b)).

• Because the partitioning is uniform, the planes that divide the mesh

3

bisectionfirst

second bisectiondone recursively

(a)

equidistant

a priori computeddividing planes

(b)

Figure 1: Recursive vs. a priori partitioning: (a) Recursive partitioning inthree subspaces. (b) A priori partitioning in 3 subspaces.

along the same coordinate axis must be parallel and equally spaced.Furthermore, each dividing plane must split the whole domain, notonly an existing subspace like in CND (see Figure 1(b)).

The partitioning algorithm has two stages. In the first step, that we callsubspace partitioning, the input coarse scene is replicated in every processor;this fact does not represent a drawback due to the low storage requirementsfor the coarse scene. Next, every processor divides the scene domain, deter-mines its subspace and computes its spatial boundaries. In the second stage,that we call subspace distribution, scene objects are distributed among pro-cessors according to the prior subspaces assignment. Once distribution is setup, each processor has a disjoint set of polygons and only performs meshingof these polygons.

2.1 Subspace partitioning

Each processor reads the whole scene data and computes the bounding vol-ume containing the entire scene. Next, each processor divides this volumeinto p uniform disjoint subspaces, where p is the number of processors. Thenumber of voxels per dimension (nx, ny and nz) is computed as the primefactors (di) of p, p =

∏mi=1 di, with p being an even number. To avoid long

subspaces, which could happen if the number of divisions in one dimensionwould be much greater than in the others, the prime factors are assignedto each dimension in the following way: Initially, nx = ny = nz = 1; thefactors are arranged in descending order; then the factors list is traversedand each factor di is assigned to the nx, ny, nz with the lowest value. Asan example, if we work with 36 processors, the sorted list of prime factors

4

(a) (b)

(c)

Figure 2: Three possible cases with a polygon inside a subspace.

would be 3, 3, 2, 2 and the voxels per dimension would be nx = 3, ny = 3and nz = 2× 2 = 4.

Subspaces are assigned to processors according to their coordinates. Dur-ing the execution of the parallel algorithm processors will exchange datausing a protocol that will be presented in Section 4. In order to simplifythe communication process, subspaces that are geometrically adjacent areassigned to neighboring processors. The assignment of subspaces is carriedout as follows: the subspace with coordinates (cx, cy, cz) is assigned to theprocessor identified as cx + cz · nx + cy · nx · nz.

2.2 Subspace distribution

The distribution of the scene is executed, like partitioning, in parallel by allthe processing elements. First, polygons are classified based on its geometricposition as totally inside, partially inside or totally outside. A polygon istotally inside a subspace if all of its vertices are inside the subspace (Fig-ure 2(a)). A polygon is partially inside if one or more of its vertices areoutside and at least one of the following conditions is true: any vertex isinside the subspace, or any edge of the subspace intersects the polygon(Figure 2(b)) or, finally, any edge of the polygon intersects any face of thesubspace (Figure 2(c)). Otherwise a polygon is totally outside.

Then, polygons are processed according to the previous classification:polygons totally inside the subspace are added to the list of polygons ofthe subspace; polygons partially inside the subspace are clipped, adding to

5

v0

1v 2v

3v

ie0

is 0

Exiting intersection

Entering intersection

x x

x

ia0

Interiorvertex

Edge intersection

(a)

v0

v1 v2

is0

ie01

ie1

x x

x

x x

x

Edge intersections

Exiting intersections

Enteringintersections

ia0 ia1

is

(b)

Figure 3: Two instances of polygonal clipping against a subspace.

the list of polygons of the subspace the part inside and discarding the rest.Finally, if the polygon does not belong to the subspace, it is totally discardedand not added to the list.

3 Clipping method

Clipping of a polygon against a domain consists of cutting out from thepolygon those pieces that are outside the frontiers of such domain. In thissection we present a 3D extension to the Liang-Barsky [9] algorithm whichclips polygons against rectangles. Four different cases must be considered:

• Interior vertices: vertices of the original polygon that are inside thesubspace.

• Entering intersections: if the vertex vi is outside the subspace and thevertex vi+1 is inside the subspace, then the edge vi − vi+1enters intothe subspace and produces an entering intersection in one of its faces.

• Exiting intersections: if the vertex vi is inside and the vertex vi+1 isoutside, then the corresponding edge exits out of the subspace andproduces an exiting intersection.

• Edge intersections: those produced by the intersection of one edge ofthe subspace with the polygon.

Figure 3 shows the four types of vertices that can arise as the result ofpolygon clipping. In this figure vertex v3 is inside the subspace and so it isan interior vertex; the following vertex, v0 is outside the subspace, so the

6

edge v3 − v0 crosses the subspace at is0, producing an exiting intersection.The edges v0 − v1 and v1 − v2 do not create intersections because all oftheir vertices are exterior vertices. The edge v2 − v3 crosses the subspacemaking the entering intersection ie0. Since the exiting intersection is0 andthis entering intersection ie0 belong to different faces of the subspace theremust be an edge intersection between them, named ia0.

In addition to taking into account the above considerations, every time anew clipping vertex is inserted in the clipping list the immediately previousand following vertices in the list must be checked to prove if they fulfill theconditions that their position involves. For that purpose we create a datastructure that stores all the information concerning to a vertex as can be seenin Table 1. The field named type indicates the type of the correspondingelement in the list: it can be an interior vertex, an entering intersectionvertex, an exiting intersection vertex, or an edge intersection. Fields namedface1 and face2 indicate the faces associated to the vertex. If the vertexis associated to just one face, the field face2 is left blank. If the vertex isinterior both face1 and face2 are left blank. The field link face stores theface which is shared with the following vertex in the list.

VertexType

Associated face 1Associated face 2

Link face

Table 1: Element of the clipping list.

3.1 Clipping algorithm

Our method consists of three stages in which the vertices involved in theclipping of a polygon are computed and stored in an ordered list. In the firststage, interior vertices and entering and exiting intersections are computed.In the second stage edge intersections are computed, and in the last stagethe resulting clipped polygon is tessellated. The insertion of vertices in thecorrect position in the clipping list is a task that needs further processing.

In the first stage of the clipping method we traverse the list of verticesof the polygon and for each vertex vi the following tasks are carried out:

• If the vertex is inside the subspace then it is stored in the first freeposition in the clipping list.

7

v0

v1 v2

0is 0ie

x

x

x

x

ia0 ia1

(a)

v0 0is 0ie

Face 2Face 1Type

Link face

Vertexinside exiting entering

left right

left right

(b)

Link faceFace 2 right frontFace 1 front left

edgeTypeVertex

edge1 ia0ia ia0 ia1

Link faceFace 2 front rightFace 1 left front

edgeTypeVertex

edge

(c)

Figure 4: Result of the first and second stages of the clipping method: 4(a)Clipping example. 4(b) Clipping list after the first stage. 4(c) Swapping inthe auxiliary list (2nd stage clipping method).

• We check if any of the six faces of the subspace intersects with theedge vi − vi+1. An edge intersects at most two faces of the subspace,in which case it will produce an entering and an exiting intersection.If any intersection is coincident with vi or vi+1 it is discarded. Whenthere are two intersections, the entering one is inserted in the first freeposition of the clipping list, and the exiting intersection is inserted inthe next position. If there is just one intersection, it is stored in thefirst free position in the list. For every intersection added to the listthe fields face1 and link face are initialized with the intersected faceof the subspace.

Figure 4(a) shows the clipping of a polygon used to illustrate the method.The procedure begins at the vertex v0 that, as it is an interior vertex, it isinserted in the first position of the list. Then, the edge v0 − v1 exits thesubspace and intersects the left face, that will be set up as its face1 and linkface, generating the exiting intersection is0, that it is inserted in the list.Vertex v1 is outside the subspace and the edge v1−v2 does not intersect thesubspace. Finally, v2 is outside the subspace, and the edge v2 − v0 entersthe subspace through the right face, generating the entering intersection ie0.This intersection is inserted in the list, with the right face of the subspaceas its face1 and link face. Figure 4(b) shows the clipping list at the end of

8

the first stage.In the second stage every edge of the subspace is tested for intersection

with the polygon. If the intersection exists, it is stored in a secondary listsimilar to the clipping list. Then, for every element of the secondary list thefollowing steps take place:

• The intersection of the edge is compared with the points in the clippinglist to check if they match. This could happen if any vertex of thepolygon belongs to one of the edges of the subspace, or if an edge ofthe polygon lies in the same plane that a face of the subspace (thusintersecting at the same time an edge and a face of the subspace). Inthese cases the intersection is discarded.

• If the intersection is not discarded we search the list to find the positionwhere the insertion should take place. For each element of the clippinglist its field link face is tested to check if it matches with any of thefields face1 or face2 of the secondary list element. In that case thisedge intersection should be inserted in the clipping list right after thecurrent intersection.

• If no position can be determined for inserting this edge intersection inthe clipping list, it has to go following another edge intersection still inthe secondary list. The element in the secondary list being processed isswapped with another element in the secondary list placed n positionsforward. Initially, and every time an edge intersection is removedfrom the secondary list and inserted in the clipping list, we set n = 1.Otherwise n is incremented by one and the process is repeated withthe element resulting from the swapping.

• Finally, an edge intersection is inserted in the clipping list, moving thefollowing elements one position forward. Besides, the field link faceof the edge intersection is updated with the associated face differentfrom the previous link face.

The left table in Figure 5 shows the secondary list with the edge inter-sections for the example in Figure 4. The order of edge intersections in thislist depends on the order in traversing the edges of the subspace. The righttable shows the list after the swapping has been done.

In the third stage, if the resulting polygon after clipping is not a triangle,it is decomposed in triangular polygons (tessellation).

9

ia03

21

−10isv0 ia0 0ie

2

−13

3

ia13

−1

23 ia1

0isv0 ia0 0ie

Face 2Face 1Type

Vertex

Link face

VertexType

Face 1Face 2

Link face

VertexType

Face 1Face 2

Link face

0

−1−1−1

−111

1

3

21

2

3233

0

−1−1−1

−111

1

3

21

2

−13

32

Figure 5: Insertion of an intersection during the second stage of the clippingmethod.

4 Parallel Algorithm

Our implementation is based in a coarse-grained parallelism, so every pro-cessor computes the radiosity solution for a set of patches of the scene. Themessage-passing library MPI is used to manage interprocessor communica-tion. The different phases of the parallel algorithm are described in detailin the next subsections.

4.1 Local Radiosity Computation

In each iteration every processor chooses the patch with the greatest energyto shoot, Pi (among all patches inside its local subspace and patches receivedfrom other processors/subspaces). The radiosity of the selected patch is shotand, as a result the other patches of the local subspace, Pj , may receive somenew radiosity and the patch Pi has no energy left to shoot. Each shootingstep can be viewed as multiplying the value of radiosity to be shot by acolumn of the form factor matrix. Therefore, in each iteration the requiredform factors and the visibility values between the chosen patch and the restof the patches of the local subspace a re computed. In order to calculatethe form factor we have used the analytic method called differential area toconvex polygon [2]. On the other hand, for visibility determination, we haveimplemented an algorithm based on directional techniques [7, 12].

In addition to shooting the energy in the local subspace, the rest of thescene has to be updated too, so each patch is sent along with the localvisibility information to other processors. This visibility information is en-coded as a structure called visibility mask [1, 13]. In our approach, we definefrontiers as faces separating two subspaces. A frontier is divided into n× n

10

1

0

Figure 6: Computation of the starting visibility mask.

patches. Each of these patches has associated a boolean value. This valueis 1 if the frontier patch is visible from the selected patch, and 0 otherwise(see Figure 6). Once the initial visibility mask is calculated, the processorsends it attached to a copy of the patch to the neighboring processors whichhave a common frontier with it.

4.2 Communication stage

Each processor stores the patches received from other processors in a queueof pending patches: a patch stays at that queue while its visibility informa-tion is not complete. Once the visibility information for a received patch iscomplete, the patch is moved from the pending queue to a queue of unshotpatches. Besides, now this patch can be sent to the neighboring processors.

In order to know when the visibility information is complete and to avoidsending a patch several times to a processor a flag is attached to each patch.This flag has two fields; one contains the information required to identify theprocessor which sends the patch and other identifies that patch in particu-lar. Then, from this information the processor determines the frontiers fromwhich it is going to receive visibility masks for the patch, and the frontiersthrough which it is going to send the new masks updated with the local vis-ibility information. Let us denote the coordinates of the subspace/processorcontaining the patch and the coordinates of this processor as (sx, sy, sz)and (cx, cy, cz) respectively. The analysis to determine whether a frontieris an input or output frontier regarding a patch consists of applying theseexpressions (x coordinate case, analogous for y and z cases):

F (cx + 1) = OUT · (sx ≤ cx) + IN · (sx > cx)

F (cx − 1) = OUT · (sx ≥ cx) + IN · (sx < cx)(1)

11

PE6 −> (0,1,0)

PE5 −> (2,0,1)

PE8 −> (2,1,0)

p

4

5

1

12

0

Figure 7: Input and output frontiers for patch p and processor 5 and 8.

11

11

11

1

11

11 1

A1

1

x

0

0

0

00

0

0

1

11

1

11

1

11

1

1C

xxxx

x

x

rays

Intersections Intermediate Mask

D

Subspace 0 Subspace 1 Subspace 2

Figure 8: Intermediate mask.

where F (cx+1) and F (cx−1) are the frontiers between the subspace assignedto the processor under consideration (cx, cy, cz) and the subspaces (cx +1, cy , cz) and (cx − 1, cy , cz) respectively; and OUT , IN indicate if it is anoutput or an input frontier.

In order to clarify, let us consider the example of Figure 7 where a scenehas been partitioned into 12 subspaces. This figure depicts the input fron-tiers for processors 5 and 8 regarding the patch P in processor 6. Processor 5must receive P by frontiers 1, 4 and 5, and it has to send nothing. Processor8 must only receive P once by frontier 1 and has to send it by frontiers 0and 2.

The propagation of one patch to the neighbor subspaces is carried outonce the patch and all the visibility information associated with it has beenreceived through the input frontiers. If the subspace has no output frontiers,then the patch information transmission finishes. Otherwise for every output

12

frontier the intermediate mask is computed. The basic idea, outlined inFigure 8, consists of updating the visibility masks by casting rays fromthe received patch to the output frontier through the input frontiers. Inthe figure the calculation of the intermediate visibility mask for the outputfrontier subspace1→ subspace2 regarding the patch A is depicted.

During this stage non–blocking communications are used to overlap com-putations and communications.

4.3 Convergence test

At the end of each iteration, after the communication stage, the convergenceis checked. In the first iteration, the patch with the greatest energy of thescene is chosen by means of a reduction operation (MPI Allreduce), there-fore this energy value is obtained in all processors. In the next iterations,each processor compares this value with the radiosity shot in that iteration.If the difference between them is greater than a given threshold and thequeue of pending patches becomes empty, the processor starts an idle state;otherwise a new iteration begins in the progressive radiosity algorithm.

In the idle state, processors send messages to one another with the sumof the number of sent and received patches. One processor would finish theiterative process whether this value is equal for all processors; otherwise itremains in an idle state because any patch could still be received.


In this section, we present performance results and analysis of the proposedscene partitioning method. We have tested our method on two differentclusters of PCs: a cluster with Pentium III 800 MHz processors connectedthrough a 100 Mbps Fast Ethernet network (F. Ethernet Cluster), and acluster with Intel XEON IV 1.8 GHz processors connected through a SCInetwork (Scalable Coherent Interface) 250 Mbps (SCI Cluster).

Three different scenes have been rendered using our method. Two ofthem are shown in Figure 9: room-model (Figure 9.a) and armadillo (Fig-ure 9.b). The third scene, largeroom-model, is roughly the result of joiningtwo room-model scenes. The scenes have 5306, 3999 and 7070 triangles re-spectively in the sequential case. The room-model has been usually used toevaluate illumination methods and the armadillo was chosen to evaluate theperformance of our method for non-uniform scenes.

The results in terms of speedup on both machines -F. Ethernet Clus-ter and SCI Cluster- are shown in Figure 10. All results have been mea-

13

(a) (b)

Figure 9: Illuminated scenes. a) room-model b) armadillo.

sured with respect to the straightforward sequential algorithm, and showthe low overhead associated with our partitioning method. The executiontimes for the sequential algorithm on the F. Ethernet Cluster and the SCICluster for the largeroom-model were 344.44 seconds and 142.88 seconds, re-spectively; the corresponding parallel execution times on 8 processors were15.17 and 10.51 seconds. As we can see, the experimental results show thata super-linear speedup is obtained for these scenes. This is mainly due totwo reasons: partitioning of the scene brings about a high improvement indata locality, with a significant reduction in cache misses and a better ex-ploitation of memory hierarchies in each processor; and the use of visibilitymasks introduces approximations in the visibility computation producing agreat reduction in the visibility determination complexity with a little lossof accuracy.

On the other hand, the parallel execution time increases with regard tothe sequential time for scenes with a small number of triangles when only afew processors are used (see Figure 11 for 2 processors). The reason for thisbehavior is the overhead associated with our method: scene partitioning andvisibility masks computation. Beyond 4 processors, the effect of this over-head is balanced with the advantages introduced with the parallelization.Note that the speedup obtained on the F. Ethernet cluster is higher thanon the SCI cluster, because the cache miss rate in the latter is higher.

Summarizing, the performance of our method on the two different sys-tems has been very good, even for scenes with different uniformity degree.

14

largeroom−model

room−model

armadillo

2 3 4 5 6 7 8 1 0

5

10

15

20

25

Number of processors

ideal

Spee

dup

1 2 3 4 5 6 7 8 0

2

4

6

8

10

12

14

Spee

dup


(a) (b)

ideal

Figure 10: Results of the test scenes (a) F. Ethernet cluster (b) SCI cluster.

SCI Cluster

F. Ethernet Cluster

2 3 4 5 6 7 8 0

10

20

30

40

50

60

1


Exe

ctut

ion

Tim

e

Figure 11: Execution time for the armadillo.

6 Conclusions

Accomplishing of a good environment subdivision takes a very importantrole in a large number of applications in computer graphics. In this workwe have presented a parallel partitioning of an input scene, achieving avery low time-consuming subdivision of the scene in an arbitrary numberof subenvironments and avoiding the replication of data in memory. Asa demonstration we have applied the algorithm to a parallel implementa-tion of a progressive radiosity method, achieving a very flexible and generalimplementation, appropriate for heterogeneous systems.

Our parallel progressive radiosity algorithm maintains a high quality ratein the illuminated scenes. Furthermore, our method achieved high speedupson the two different clusters of PCs, exploiting data locality and reducing

15

and optimizing communication.

Acknowledgements

This work was supported in part by the Ministry of Science and Technol-ogy (MCYT) of Spain under the project TIC 2001-3694-C02 and Xunta deGalicia under the project PIDT01-PXI10501PR.

References

[1] B. Arnaldi, T. Priol, L. Renambot and X. Pueyo. Visibility Masks forSolving Complex Radiosity Computations on Multiprocessors. Proc. ofthe First Eurographics Workshop on Parallel Graphics and Visualization,219–232, 1996.

[2] M. Cohen and J. R. Wallace. Radiosity and Realistic Image Synthesis.Academic Press Professional, 1993.

[3] M. Cohen, S. E. Chen, J. R. Wallace and D. P. Greenberg. A Progres-sive Refinement Approach to Fast Radiosity Image Generation. ACMComputer Graphics (Proc. of SIGGRAPH ’88), 22(4):31–40, 1988.

[4] A. Fujimoto, T. Tanaka and K. Iwata. Arts: Accelerated Ray Tracing.IEEE Computer Graphics and Applications, 6(4):16–26, 1986.

[5] R. Garmann. Spatial Partitioning for Parallel Hierarchical Radiosityon Distributed Memory Architectures. Proc. of the Third EurographicsWorkshop on Parallel Graphics and Visualization, 2000.

[6] S. Gibson and R. J. Hubbold. A Perceptually-Driven Parallel Algorithmfor Efficient Radiosity Simulation. IEEE Transactions on Visualizationand Computer Graphics, 6(3):220–235, 2000.

[7] E. A. Haines and J. R. Wallace. Shaft Culling for Efficient Ray-TracedRadiosity. In Proc. of Photorealistic Rendering in Computer Graphics(Second Eurographics Workshop on Rendering), 122–138, 1994.

[8] P. Hanrahan, D. Salzman and D. Aupperle. A Rapid Hierarchical Ra-diosity Algorithm. ACM Computer Graphics (Proc. of SIGGRAPH ’91),25(4):197–206, 1991.

16

[9] Y-D. Lian and B. Barsky. An Analysis and Algorithm for Polygon Clip-ping. CACM, 26(11):868-877, 1983.

[10] D. Meneveaux and K. Bouatouch Synchronization and Load Balancingfor Parallel Hierarchical Radiosity of Complex Scenes on a HeterogeneousComputer Network. Computer Graphics Forum, 18(4):201–212, 1999.

[11] E. Padron, M. Amor, J. Tourino and R. Doallo. Hierarchical Radiosityon Multicomputers: a Load–Balanced Approach. Proc. of the TenthSIAM Conference on Parallel Processing for Scientific Computing, 2001.

[12] E. J. Padron, M. Amor, J. R. Sanjurjo and R. Doallo. ComparativeStudy of visibility algorithms in Global Illumination Methods. Technicalreport (in Spanish), 2003.

[13] L. Renambot, B. Arnaldi, T. Priol and X. Pueyo. Towards Efficient Par-allel Radiosity for DSM-based Parallel Computers Using Virtual Inter-faces. Proceedings of the Third Parallel Rendering Symposium (PRS’97),79–86, 1997.

[14] K. Schloegel, G. Karypis and V. Kumar. CRPC Parallel ComputingHandbook, chapter Graph Partitioning for High Performance ScientificSimulations Morgan Kaufmann, 2000.

[15] O. Schmidt and L. Reeker. New Dynamic Load Balancing Strategy forEfficient Data-Parallel Radiosity Calculations. Proceedings of Paralleland Distributed Processing Techniques and Applications (PDPTA ’99),1:532–538, 1999.

17

Efficient Scheduling for Low-Power High-Performance DSPApplications

Zili Shao, Qingfeng Zhuge, Youtao Zhang, Edwin H.-M. ShaDepartment of Computer Science

University of Texas at DallasRichardson, Texas 75083, USA

zxs015000, qfzhuge, zhangyt, edsha @utdallas.edu

Abstract

To process high performance DSP applications in em-bedded systems, VLIW (Very Long Instruction Word) ar-chitecture is widely adopted in high-end DSP processors.While better performance is achieved, more power con-sumption may be needed in order to execute several in-structions simultaneously in a VLIW processor. Therefore,an important problem is how to schedule an application ona VLIW processor such that the energy consumption of theapplication is minimized while the timing performance issatisfied.

To solve this problem, we identify the two most impor-tant factors: switching activity and schedule length, thatinfluence the energy consumption of an application exe-cuted on a VLIW processor. Considering these two fac-tors together, we present a novel instruction-level energy-minimization scheduling technique to reduce the energyconsumption of an applications on a VLIW processors. Wefirst formally prove that this problem is NP-complete. Thenthree heuristic algorithms, MSAS, MLMSA, and EMSA, areproposed and studied. While switching activity and sched-ule length are given higher priority in MSAS and MLMSA,respectively, EMSA gives the best result considering bothof them. The experimental results show that EMSA gives a

% reduction in energy compared with the standard listscheduling approach on average.

1 Introduction

In embedded systems, high performance digital signalprocessing (DSP) used in image processing, multimedia,wireless security, etc., needs to be processed not only withhigh data throughout but also with low power consump-tion. To satisfy the ever-growing performance requirement,a high-performance architecture with multiple functionalunits such as VLIW (Very Long Instruction Word) archi-

This work is partially supported by TI University Program, NSF EIA-0103709 and Texas ARP 009741-0028-2001, USA.

tectures, for example, TI TMS320C6K, is now commonlyused. Since multiple functional units are executed simul-taneously in these architectures, power consumption be-comes one of the most important issues to be consideredwith the concern of performance. Therefore, an efficientscheduling scheme is needed to reduce the energy con-sumption of an application as well as guarantee the re-quired timing performance. Recent research for variousprocessors [1, 2] shows that the instruction sequence ofan application plays an important role in its energy con-sumption. Thus, new research directions in power opti-mization have begun to address the issues of instruction-level scheduling for reducing energy consumption [3–5].In this paper, we analyze and design efficient algorithmsfor various types of instruction-level energy-minimizationscheduling problem, i.e., given a Directed Acyclic Graph(DAG) that models an application, how to schedule it to aVLIW architecture so that the energy consumption can beminimized while the required timing performance is satis-fied.

A VLIW processor executes a very-long instructionword (called long instruction word) during each clock cy-cle. A long instruction word is composed of several paral-lel sub-instructions. The power consumption to execute along instruction word during a clock cycle, , can becomputed by:

!#" $%& !#" $% (' *),+.-0/132 (1)

where 4 is the base power needed to support in-struction execution, !," $% is the basic power to executea sub-instruction 576 on a functional unit, and ' *),+.-0/1 isthe switching power caused by switching activities be-tween 598;:< 6 (current sub-instruction) and 5=8;:<?> (last sub-instruction) executed on the same functional unit (FU). Let' be a schedule for an application and @ the schedule lengthof ' . Then the energy A4B for Schedule ' can be computed

1

by

A B

@*

!#" $ %

!#" $ %

!#" $ % '

) +.-0/1 is the summation of basic power consumptions forall sub-instructions of an application. It does not changewith different schedules. @ and

' *),+.-0/1 will changewith different schedules though. Therefore, in order tominimize the energy consumption of an application, sched-ule length and switching activity both need to be consid-ered in scheduling.

A lot of research efforts have been put in this field.In [6], compiler techniques are proposed to reducepower variations, hot-spots and peak power in schedul-ing. Though these techniques can efficiently reduce powervariations, they don’t aim to minimize the total energyconsumption. Low power resource allocation approach[3, 7, 8] attempts to find an allocation for a fixed sched-ule in such a way that the total switching activities canbe reduced. While this approach is effectively appliedin resource allocation in the high-level synthesis, it maygive inferior results in solving instruction-level energy-minimization scheduling problem because scheduling andallocation are performed separately. In [5], a two-phasescheduling approach is proposed to optimize transitionactivity in the instruction bus on a VLIW architecture.Based on an initial schedule, this approach performs lowpower resource allocation (called horizontal scheduling)in the first phase and vertically schedule instructions inthe second phase to further reduce the switching activi-ties. Since the schedule length is fixed priori, this ap-proach can not be directly applied to solve instruction-levelenergy-minimization scheduling problem. In [4], severalrevised list scheduling techniques are proposed to mini-mize energy based on the instruction-level energy modelsfor the specific processors. Using similar energy models,in [9], several energy-oriented instruction scheduling ap-proaches are presented and compared with performance-oriented scheduling. However, these techniques are notgeneral enough to be applied to VLIW processors becauseof the specific architectures.

Several techniques [10–12] are proposed to minimizeswitching activities only. In [10], an instruction schedulingtechnique, called cold scheduling, is proposed to reduce theswitching activities on the control path. In [11], an operandsharing scheduling technique is proposed to schedule theoperation nodes with the same operands as closely as pos-sible to reduce the switching activities on the functionalunits. In [12], a scheduling algorithm for optimizing coeffi-cients of a FIR filter is proposed to minimize the switchingactivities on memory data bus and function units. Sinceschedule length is not considered, these techniques maygive inferior results for energy minimization.

In recent works [13, 14], the power efficient schedulingproblem is formulated as the Traveling Salesman’s Prob-lem (TSP) and solved by heuristics of TSP when thereis one FU. However, formulating a problem as TSP doesnot necessarily mean that the problem is NP-complete.For example, the problem to sequence jobs that requirecommon resources on a single machine [15] can be trans-fered to TSP but still polynomially solvable. In the litera-ture, it is still unknown that the instruction-level schedul-ing problem with minimum switching activities (calledminimum-switching-activities scheduling problem) is solv-able in polynomial time or NP-complete.

In our work, we formally prove minimum-switching-activities scheduling problem is NP-complete no matterthere are resource constraints or not. Based on this result,we further prove that instruction-level energy-minimizationproblem is NP-complete with or without resource con-straints. While minimum latency scheduling problem ispolynomial-time solvable if there is only one FU or noresource constraints, the problem becomes NP-completewhen considering switching activities as the second con-straint.

To solve the instruction-level energy-minimizationproblem on VLIW architectures, three algorithms areproposed. We first propose two algorithms, MSASand MLMSA, to solve two special cases of the energy-minimization problem. MSAS algorithm is designed for thecase when switching activity plays the most important rolein total energy consumption. MLMSA is for the case whenschedule length plays the most important role. In bothMSAS and MLMSA algorithms, we consider all nodes inthe ready list in each schedule step and use weighted bipar-tite matching to do scheduling and allocation simultane-ously. Then an algorithm, Energy Minimization Schedul-ing Algorithm (EMSA), is proposed to solve the generalinstruction-level energy-minimization scheduling problem.In EMSA, we start from an initial schedule and resched-ule the nodes to reduce switching activities when relaxingthe initial schedule to a given timing constraint. The bestschedule is selected from all possible schedules up to point.The experimental results show significant reductions in en-ergy consumption. When switching activity plays the mostimportant role in total energy consumption, MSAS gives areduction of 41.2% in switching activity over the standardlist scheduling on average. When schedule length plays themost important role in total energy consumption, MLMSAkeeps the same schedule length as list scheduling and re-duces 33.2% switching activities on average. For the gen-eral case, EMSA gives a 31.7% reduction in energy com-pared with the list scheduling on average.

The remainder of the paper is organized as follows.Section 2 introduces basic concepts and energy model.Section 3 proves instruction-level energy-minimizationscheduling problem is NP-complete. The algorithms arediscussed in Section 4. Experimental results and conclud-ing remarks are provided in Section 5 and 6, respectively.

2

2 Basic Concepts and Models

In this section, we introduce the basic concepts that willbe used in later sections. A motivational example and theenergy model are also introduced in this section.

2.1 DAG and Bit String( )

A Directed Acyclic Graph (DAG), -A , is a

node-weighted graph, where is the set of nodes and eachnode represents a sub-instruction, and A is the edgeset and an edge between two nodes denotes a dependencyrelation. In this paper, the operation time of each node inthe DAG is assumed to be one time unit.

Assembly code that consists of long instruction wordsgenerated by a compiler has to satisfy the dependency re-lations in the DAG. It consists of long instruction words.A long instruction word contains multiple sub-instructions.Each sub-instruction specifies on which functional unit it isgoing to be executed. The bit switches between the consec-utive sub-instructions executed on the same functional unitcause the switching power consumption ( ' *),+.-0/1 in equa-tion 1) in the power model introduced in Section 1 (the de-tail discussions about the power model will be given in Sec-tion 2.2). Therefore, we can capture the switching powerconsumption ( ' *),+.-0/1 ) through the computation of switch-ing activities.

We define a function, Bit String( ), to count the num-ber of bit switches for each node in DAG

.

Bit String( ) is a binary string that denotes the state of sig-nal after is executed. Without losing generality, it con-tains one or multiple of the following items:

opcodes. When loading instructions from programmemory, the state of instruction bus may be changed,which causes switching power consumption. Basedon a series of experiments we did on TMS320VC5410DSP processor, we collected the core current whenexecuting the loops of different instructions and in-struction combinations using similar method in [1].The experimental results show that the core current in-creases correspondingly with the increase of the ham-ming distance of consecutive instructions. Some ex-perimental results are shown in Figure 1. It confirmsthe correctness of our power model.

addresses. To accessing operands, the states of dataaddress bus may be changed and thus cause switchingpower consumption.

operands. When the inputs are loaded from data mem-ory and processed on functional units, the states ofdata bus, control paths in the functional units, etc.,may be changed and cause switching power consump-tion. Since applications in embedded systems are veryspecific, we can predict the possible values or patternsfor the inputs priori. Note that an operand here is apossible state but not a real input.

Don’t Care. Don’t Care means to keep the previousstate. For example, if we execute an add instructionwhose inputs are from two registers and output is putinto a register, then the states of data bus and data ad-dress bus are not affected. Thus, these states shouldbe kept.

MPY #1, AADD A, BSUB A, B

48.2 mA

MPY #1, AADD A, BADD A, B

46.6 mA 42.5mA

42.5mA

Instruction Loop The Core CurrentInstruction Loop The Core Current

ADD A, B

SUB A, B

SUB A,B ADD A,B

43.7mA

46.3mAMPY #1, A

Figure 1. Some experimental results for TITMS320C5410 DSP processor

A motivational example is shown in Figure 2- 4. Toschedule a Finite Impulse Response (FIR) filter on an ex-perimental VLIW processor, we show how Bit String() canbe used to denote the state of signal and how differentschedules can cause different bit switches. The architec-ture of our experimental VLIW processor is shown in Fig-ure 2(a). It is a simplified model and similar to one clusterof TI C6000 VLIW processors. It uses Harvard architec-ture with separate memory space for program and data. Forthe illustration purpose, we assume that program memoryhas a 30-bit instruction bus and data memory has a 4-bitaddress bus and data bus. Thus, in each clock cycle, a 30-bit long instruction word composed of 3 sub-instructionsis loaded to the decoder and the sub-instructions are redi-rected to corresponding functional units to be executed af-ter decoded. There are 3 functional units:

@ , ,

,where

@ can perform load/store or addition operation,

can perform multiplication or addition operation, and

can only perform addition operation. According to thespecifications of the functional units, a long instructionword has three possible combinations. One is to containone load/store instruction, one multiplication instruction,and one addition instruction. Another is to contain twoaddition instructions and one load/store or multiplicationinstruction. The third one is to contain 3 addition instruc-tions. The instructions and the corresponding opcodes areshown in Figure 2(b). We assume the operation time ofeach instruction is one time unit. Direct addressing is usedby load/store instructions. Multiplication ( ) and ad-dition ( ) operations are performed on general purposeregisters

. Their inputs are from two registers and

the result is stored into the second register. For example, - will store the result into

!.

3

A FIR filter is shown in Figure 3(a). Figure 3(b) showsthe addresses and the values of its coefficients ( )and its first 4 inputs ( ). The assembly codes,the corresponding opcodes and node names are shown inFigure 3(c), in which the coefficients and inputs are firstloaded into

and then computed. The DAG repre-

sentation for the assembly program is shown in Figure 3(d).Since bus capacitances are usually several orders of magni-tude higher than those of the internal nodes of a circuit [16],a considerable amount of power can be saved by reducingthe switching activities on the buses. Therefore, we willonly consider reducing the transitions on the instructionbus, data address bus and data bus in this example. For eachnode , Bit String(u) denotes the states of three buses af-ter executing node . It consists of three parts: the opcodeon the instruction bus, the operand on the data bus, and theaccess address on the data address bus as shown in Fig-ure 4(a). Usually we don’t know the inputs at the compil-ing time. However, we can predict their possible values orpatterns for a specific application. In this example, we as-sume our prediction for the operand part in 4+0< ' <+08 ) 1is totally correct. Since multiplication, addition and nopoperations can not influence the states of data memory, weput “—” under column “data addr.” and “data bus” to de-note the state of “Don’t Care”. Two different schedules areshown in Figure 4(b) and (c), respectively. Schedule 1 isobtained by traditional list scheduling and Schedule 2 is ob-tained by our MSAS algorithm. Assume the initial states ofbuses are all 0’s. Then the total number of bit switches foreach bus is the total number of transitions for a schedule.The number of transitions for Schedule 1 is 121 and thatfor Schedule 2 is 66. Since they have the same schedulelength, Schedule 2 consumes less energy than Schedule 1when executed on our experimental VLIW processor. Thegenerated codes with long instruction words can be foundin the column of instruction bus in Figure 4(b) and (c).

2.2 Power Cost Function and Energy Model

Our energy model is based on the methodology for theinstruction-level energy estimation framework for VLIWarchitecture proposed in [17–20]. In VLIW processorssuch as TI TMS320C620x and TMS320C670x DSP de-vices, there is no data cache and application programs areloaded into on-chip-memory before they are executed [21],therefore the influence of cache miss is not considered inour energy model.

As shown in equation 1, the power consumption to exe-cute a long instruction word during a clock cycle, ,can be computed by:

; !#" $%& !," $% (' ) +.-0/132

In VLIW processors, denotes the power to supportbasic pipeline execution even when a long instruction wordcontains only NOPs. !#" $ % denotes the basic power for afunctional unit to execute a sub-instruction on a specific

datapath even when the inputs don’t cause any transition.' *),+.-0/1 denotes the switching power caused by switch-ing activities between 598;:< 6 (current sub-instruction) and5=8;:<.> (last sub-instruction) executed on the same .

The switching power is proportional to the number oftransitions. So

' *),+.-0/1 )4+0< ' <+08 )75=8;:< +.1-+0< ' <+08 )75=8;:< / 11(2)

where is a power coefficient representing the con-sumed power per transition, and (Weighted Ham-ming Distance) is a function used to compute thenumber of transitions between 598;:<.6 and 5=8;:< > . Let = 4+0< ' <+08 )9598;:< +.1 and = 4+0< ' <+08 )75=8;:< / 1 , wedefine ) -*1 as:

) - ;1 !6

" 6 ) +#%$&' +#1 (3)

where " 6 is the weight of a transition. " 6 is used to de-note the weight for the power consumption caused by onetransition on different units.

The energy consumption of an application is the sum-mation of all its power consumption during each clock cy-cle. Let ' be a schedule for an application and @ the sched-ule length of ' . Then the energy consumption of Schedule' , A B , can be computed by

@

!," $ %

!#" $ %

!#" $ % '

) +.-/ 1 is the summation of basic power consumptions forall sub-instructions of an application that does not changewith different schedules. 4 is a constant and @varies with the schedule length for a specific VLIW proces-sor.

' *),+.-0/1 is the switching power and changes withdifferent schedules. Therefore, schedule length and switch-ing activity need to be considered together in instruction-level scheduling techniques in order to minimize energyconsumption of an application. One example is shown inFigure 5.

In Figure 5(a), a DAG is given, in which the binarystring close to each node is the value of Bit String( ).Given 2 functional units, and ( , three differentschedules are shown in Figure 5(b)-(d). An empty slot de-notes a NOP node. We assume a NOP node doesn’t causeany transition. Schedule 1 and Schedule 2 have the sameschedule length but Schedule 2 has less bit switches. Thuswe can easily identify Schedule 2 has less energy consump-tion than Schedule 1 has. To compare the energy consump-tions of Schedule 2 and Schedule 3, we need to considerthe schedule length. Assume that the power coefficient in equation 2 is 1 and is 1 in equation 2. The en-ergy for schedule 3 is ) +* andthat for Schedule 2 is , .- .So Schedule 3 uses less energy than Schedule 2. How-ever, when , , the energy for Schedule 3 is

4

0910192029

sub−instruction 1sub−instruction 0 sub−instruction 2

Docoder

ProgramController Reg. File

MemoryProgram

Data Memory

.L

.A .M

A0−A7

29

I0

Instr. Bus

I

D0 D1 D2 D3 A0 A1 A2 A4

Architecture:Data Bus Data Address

30−bit long instruction word:

(a)

0 01

Address

mpy reg1, reg2 1 1

reg1 reg21 1

01 1st reg, addr

ld (addr), reg

nop

add .M reg1, reg2 1 1

reg1 reg201

0 0 0 0 0 0 0 0 00

add .A reg1, reg2 01 1

reg20

reg. Address

reg.

reg1

add .L reg1, reg2 01 1

reg1 reg21

Instruction Opcode (10 bits)

(b)

Figure 2. (a) A VLIW processor. (b) The instruction set.

5

3 4 5281

9 12

7 6

Memory Address

A FIR Filter:

(b)

0000 0010

11011110

10100101011101101010

0001

011101100101010000110010

(b0)(b1)(b2)(b3)(x[0])(x[1])(x[2])(x[3])

10 11

14 13

15

Y[n]=b0 X[n]+b1 X[n−1]+b2 X[n−2]+b3 X[n−4]

(a)

Assembly Code

(c)

ld (0000), A0ld (0001), A1ld (0010), A2ld (0011), A3ld (0100), A4ld (0101), A5ld (0110), A6ld (0111), A7mpy A0, A7

Machine Code Node 123456789

add .A A5, A7add .L A6, A7 add .A A4, A5mpy A3, A4mpy A2, A5mpy A1, A6 10

1112131415

100 0000 000100 0001 001100 0010 010100 0011 011100 0100 100100 0101 101100 0110 110100 0111 1111111 000 1111111 001 1101111 010 1011111 011 100110 0 100 101110 1 110 111110 0 101 111

(d)

Figure 3. (a) A FIR filter (b) The coefficients and inputs in data memory (c) The assembly program,opcodes and node id (d) The DAG representation

6

21

3456789

1110

109

121114

Sch. Step

Sub−instru. 0Sub−instru. 00000000000 0000000000 0000000000

(1111001110)(1111000111)(1111011100)(1111010101)(1101110111)

13 (1100100101)

(1100101111)15

Instruction BusSub−instru. 0

00000001010101000110011100110010

00101010101001101110110101110101

0000 0000

Data Bus Data Addr.

12657843

(1000000000)(1000001001)(1000101101)(1000100100)(1000110110)(1000111111)(1000011011)(1000010010)

nop nopnop nopnop nopnop nopnop nop

nopnop

nopnopnop

nopnopnop

2

1

3

4

5

6

7

8

9

1

8

2

7

3

6

4

5

13

15

12

11

10

Step

Sch.

Sub−instru. 00000000000

(1000000000)

(1000111111)

(1000001001)

(1000110110)

(1000010010)

(1000101101)

(1000011011)

(1000100100)

(1111011100)

(1100100101)

(1100101111)

14

9

10

11

Sub−instru. 2Sub−instru. 1

(1111000111)

(1111001110)

(1101110111)

(1111010101)

Data Addr. Data BusInstruction Bus

0000000000 00000000000000

0111

0001

0110

0101

0010

0011

0100

0010

1101

1010

1110

1010

0101

0111

0110

0000 0000

nop nopnop nop

nopnopnopnopnopnopnopnop

nopnopnop

nopnopnop

(c)

(b)

21

345678

1011121314

Node

9

15

100 0000 000100 0001 001100 0010 010100 0011 011100 0100 100100 0101 101100 0110 110100 0111 1111111 000 1111111 001 1101111 010 1011111 011 100110 0 100 101

110 0 101 111110 1 110 111

Opcode0000 00100001 1010

0011 01110010 0101

0100 01100101 10100110 11100111 1101

00000 00000nop

Bit_String( i ) Data Addr. Data Bus

(a)

Figure 4. (a) Bit String() (b) Schedule 1 (c) Schedule 2

7

1 2 3 4

5 6

7

8

001

101

010

110

001

010 110

110

12345

1 2

7

Step FU1 FU2

5 63 4

12345

2 (010)

(110)

(001) (010)

(110)

(000)(000)

Step FU1 FU2

(001)1

7

3 45 6

(110)

8 (101)

(001)

(000)(000)

(010)

(110) (110)

(001) (010)

(110)

1234

1 2(000) (000)

(001) (010)

(001) (010)

Step FU1 FU2

5 6

7 (110)

(110)4(110)3

5

6 (101)8

(c) (d)(a) (b)

(101)8

Figure 5. (a) A given DAG (b) Schedule 1 with total bit switches 15 (c) Schedule 2 with total bitswitches 8 (d) Schedule 3 with total bit switches 4

) , * * and that for Sched-ule 2 is , , - . Then Schedule2 uses less energy. Thus, we have to consider both schedulelength and switching activity when comparing the energyconsumption of schedules.

2.3 Weighted Bipartite Matching

In our algorithms, we use the weighted bipartite match-ing graph to help find the binding of nodes in a DAG andresources. A Weighted Bipartite Graph

-A - is an edge-weighted undirected graph, where can bepartitioned into two sets and such that each edge ) -1! A implies and or and , and ) 1 represents the weight of edge A .

Given a bipartite graph -A - , a matching

M for , is a subset of E such that for each node ,

at most one edge of M is incident on . Hence, the weightof a matching is the summation of the weight of all edges inM. A maximum matching is a matching of maximum car-dinality, i.e., a matching M such that for any matching

,

we have

, where

denotes the cardinality ofM. A min-cost maximum matching is a maximum matchingwith a minimum weight.

2.4 The Optimization Problem

Our optimization problem is defined as follows: Givena DAG

-A , a function Bit String( ) for each node , and

functional units, find a schedule ' of

that

has minimum energy consumption based on energy modelin Section 2.2.

3 NP-Completeness

In this section, we prove that our optimization problemis NP-complete.

From the energy model in equation (2), we know theenergy consumption of a schedule is related to schedulelength and switching activity. To prove our optimizationproblem is NP-complete, we use the following method. Wefirst consider two subproblems of our optimization prob-lem: minimum-latency scheduling problem that only min-imizes schedule length and minimum-switching-activitiesscheduling problem that only minimizes switching activi-ties. If we can prove one of these two problems is NP-complete, we can further prove our optimization prob-lem is NP-complete. It is well-known that minimum-latency scheduling problem is polynomial-time solvablewhen there is only one functional unit or no resourceconstraints. Therefore, the focus is put on minimum-switching-activities scheduling problem. As discussed inSection 1, it is still unknown that minimum-switching-activities scheduling problem is solvable in polynomialtime or NP-complete in the literature. Thus, we catego-rize the problem into two cases as shown in Theorem 3.1and Theorem 3.2 and prove them as follows.

Theorem 3.1. Let be the number of resources, when , min-switching-activities scheduling problem is NP-complete.

In order to prove Theorem 3.1, we first define twodecision problems: one is the decision problem (DP1)of minimum-switching-activities scheduling problem when and the other is the @ Geometric Traveling Sales-man Problem (GTSP). We will transfer GTSP to our prob-lem in the proof.

DP1: Given a DAG -A , a function

Bit String( ) for each node , one resource, and oneconstant , does there exist a schedule that has switchingactivities at most ?

The @ geometric traveling salesman problem(GTSP) [22]: Given a set ' of integer coordinate pointsin the plane and a constant @ , does there exist a circuitpassing through all the points of ' which, with edge lengthmeasured by @ , has total length less than or equal to @ ?

8

Proof. It is obvious DP1 belongs to NP. Assume that' & - 0- - 0- - " - " #2 is an instance ofGTSP. Construct DAG

-A as follows: - - - " where 6 corresponds to a point 6 - 6 in ' and A

. Assume that max ) 6 1 and max ) 6 1 for

+ 8 , then Bit String( 6 ) ) 6 1 : 6 : ) 6 1 : 6 : for each 6 ) + 8 1 ,where “ ” denotes concatenation.For example, if , ( , and ,then Bit String( )=011 001. Set @ . Since GTSP isNP-complete and the reduction can be done in polynomialtime, DP1 is NP-complete.

Theorem 3.2. Let be the number of resources, when , min-switching-activities scheduling problem is NP-complete.

The decision problem (DP2) of minimum-switching-activities scheduling problem when is similar toDP1 except the number of resource is greater than 1 inDP2. The proof of Theorem 3.2 is as follows.

Proof. It is obvious DP2 belongs to NP. Assume that' & - 0- - 0- - " - " #2 is an instance ofGTSP. Construct DAG

-A as follows: - - - " where 6 corresponds to a point 6 - 6 in ' and A . Assume that max ) 6 1 and max ) 6?1 for

+ 8 , then Bit String(6 ) ) ( 1 : )6.1 : 6 : ) 6?1 : 6 : for each 6 ) + 8 1 . For example, if , ( ,and , then Bit String( )=11111111 011 001. Set @ . Set the initial Bit String() of each resource to all0’s.

The construction of makes all nodes in be assignedto the same resource for minimizing switching activities.Since the reduction can be done in polynomial time, DP2is NP-complete.

From Theorem 3.1 and 3.2, we know minimum-switching-activities scheduling problem is NP-complete.Then we use it to prove our optimization problem is NP-complete.

Theorem 3.3. Our optimization problem for instruction-level energy-minimization scheduling is NP-complete.

Proof. Given an instance of minimum-switching-activitiesscheduling problem, it can be directly mapped as an in-stance of our optimization problem by setting =0in the energy model. Since the reduction can be donein polynomial time, our optimization problem is NP-complete.

4 The Algorithms

In this section, we present three algorithms to solveinstruction-level energy-minimization scheduling problem.Two algorithms for two special cases are presented firstand then another algorithm to solve general problem is pro-posed.

4.1 MSAS Algorithm

4.1.1 The Algorithm

Our first algorithm, Minimizing Switching ActivitiesScheduling (MSAS), is designed to solve a special case ofinstruction-level energy-minimization scheduling problem,i.e., the case when the switching activities play the mostimportant role in energy consumption.

When is very small compared with (in equa-tion 2), the energy of a schedule depends mainly on switch-ing activities. For example, when equals 0.1 in Fig-ure 5, we need to reduce 10 control steps in schedule lengthto count one bit switch. Thus, we need an algorithm tominimize switching activities as much as possible. On theother side, considering the performance, we also want tominimize schedule length. Hence, MSAS algorithm mini-mizes switching activities in first priority and still considersschedule length. Since most previous works focus on onefunctional unit, we need algorithms that can take advan-tage of multiple functional units under VLIW architectures.MSAS algorithm is shown in Algorithm 4.1.

Algorithm 4.1 MSAS (Min-Switching-Activities Scheduling)

Input: DAG , Bit String( ) for each , func-tional unit set !#" $%'&

Output: A schedule with switching activities minimization(')+*,All ready nodes in ;

while( )+*.-0/ do

Construct 2143 5163789;: , where:5143 !#" $%'&=< (')+* ; >@?!#ABCDFE G4!#AH!#" $%'&%I (4)+*KJ

and :L?!#AM;DFD :NPOQ?CRSUT $VT@W@SUXY'?ZDM8RSUT $VT@W@SUXY'?! A DFD ;[\,\[ SUX ]9^%_@T RSC`KaVWbTcSUTed [ aVTgfghSiX2Y4?82143jD ;for all dk?!#Ag;Dl [ do

Schedule Node to functional unit !kARSUT $VT@W@SUXY'?! A D , RSUT $VT@W@SUXY'?ZD ;Remove from

(')+*;

Check all adjacent nodes of and put new ready nodesinto(')+*

;end for

end while

Due to the existence of the dependency in a DAG, wecan only schedule a node after all its parent nodes havebeen scheduled. The scheduling problem with switch-ing activities minimization is how to find a matching be-tween functional units and ready nodes in such a way thatthe schedule based on this matching minimizes the totalswitching activities in every scheduling step. This is equiv-alent to the min-cost weighted bipartite matching problem.Thus, in MSAS algorithm, we repeatedly create a weightedbipartite graph

between the set of functional unit andthe set of nodes in Ready List, and assign nodes basedon the min-cost maximum bipartite matching

. In each

scheduling step, the weighted bipartite graph,

-A - , is constructed as follows: ' Amn @po6q where ' Am - - - sr is the set of all functional units and @to6q is the set of all

9

nodes in Ready List; for each FU 6 ' Am and eachnode @ o6q , an edge ) 6 - 1 is added into A and ) 6 - 1 ) 4+0< ' <+08 ) 1- 4+0< ' <+08 ) 6 11 ,where ) - *1 is weighted humming distance func-tion as defined in equation 3 and 4+0< ' <+08 ) 6 1 4+0< ' <+08 ) 1 where is the last node executed on 6 .

Since bit switches are set as the weight of edge, aschedule based on this matching optimally minimizes theswitch activities at every scheduling step. All nodes in

are scheduled at every scheduling step, so the sched-ule length can not be too big. After scheduling a node to a FU 6 , we update the state of signal on 6 by setting4+0< ' <+08 )6 1 4+0< ' <+08 ) 1 . Note that there mayexist “Don’t Care” in 4+0< ' <+08 ) 1 . In such case, weneed to put the current state instead of “Don’t Care” in or-der to correctly count bit switches on 6 . An example isshown in Figure 6.

Given the DAG in Figure 5(a), the scheduling in thefirst step by

' ' algorithm is shown in Figure 6 whenthere are 2 FUs. Ready List of the DAG in the first stepis shown in Figure 6(a). Assume that the initial states ofFUs are 000 and " 6 in weighed hamming distancefunction )01 . A weighted bipartite graph based onthe set of FUs and the set of nodes in Ready List is con-structed in Figure 6(b). A min-cost maximum bipartitematching is shown in Figure 6(c). The schedule of the firststep is shown in Figure 6(d). After scheduling Node 1 to and Node 2 to , we change the states of the FUs.The finial schedule generated by

' ' is shown in Fig-ure 5(c). Compared with Schedule 1 in Figure 5(b) gener-ated by traditional list scheduling, Schedule 2 has less bitswitches.

It is known that finding a min-cost maximum bipartitematching takes ) 8 1 by the Hungarian Method [23]. Let

be the number of functional units. In every schedul-ing step, we need at most )) 1 1 to find mini-mum weight maximum bipartite matching using Hungar-ian Method and the scheduling step is at most

. Thus,the complexity of MSAS is ) ) 1 1 .4.1.2 Applying

' ' Algorithm to VLIW architec-ture

Our generic algorithm can be easily modified to solve thescheduling problem on various VLIW architectures. In thissection, we show how to apply

' ' to the VLIW archi-tecture similar to TI C6000 VLIW processor. In such ar-chitecture, a sub-instruction can be put into any location inthe instruction bus. It is the decoder that distributes the sub-instruction to a FU. So we need to change our generic algo-rithm a little bit to handle this case. The VLIW architectureconsists of heterogeneous FUs with resource constraints.Thus, the number of sub-instructions of a type executingin a clock cycle can not exceed the resource bound. Forexample, since there is only one load/store unit in the ex-perimental VLIW processor in Figure 2, we can not sched-ule more than one load/store instruction at the same sched-

ule step. Assume that at most

sub-instructions of atype can be executed in a clock cycle. If there are morethan

sub-instructions of this type in the bipartite match-

ing in a schedule step in ' ' , we can sort the edges

with this type of sub-instructions by an increasing orderand schedule these sub-instructions from the first

edges.

Then we can remove all sub-instructions of this type from @ +:< and find the best matching for the left FUsand left nodes till every FU has been assigned a node. Insuch a way, we can reduce switching activity as much aspossible. We also need to add NOP nodes if the number ofnodes is less than the number of FUs in each step. Using6 to denote sub-instru + in the instruction bus and applyingabove, Schedule 2 in Figure 4(c) is generated by

' ' . Itreduces bit switches from 121 of Schedule 1 to 66.

4.2 MLMSA Algorithm

Our second algorithm, Minimizing Latency with Min-imal Switching Activities Scheduling (MLMSA), is de-signed to solve another special case of instruction-levelenergy-minimization scheduling problem, i.e., the casewhen is very big compared with in equation 2(the switching power caused per transition). In this case,the schedule length plays the most important role in en-ergy consumption. Thus, MLMSA is designed to generatea schedule with minimum schedule length in the first pri-ority and then reduce switching activities as much as pos-sible.

MLMSA is similar to MSAS except that we use the dif-ferent method to assign weights to the edges of weightedbipartite graph. In order to reduce the schedule length,we define Depth( ), a priority function for each node in a DAG, to be the longest path from to a leaf nodein a DAG. When constructing weighted bipartite graph , for each edge )6 - 1 , its weight is setas ) - 61 ) 4+0< ' <+08 ) 1-+0< ' <+08 ) 6?11b" 4< ) 1 @ 8 , where 4< ) 1 is the prior-ity value of node as mentioned above, and @ 8 is the bitlength of 4+08 B <+08 ) 1 , and " max 6 " 6 foreach " 6 in equation 3.

Using this method, the priority of a node is dominant inedge weight in our bipartite graph

. Thus, the nodewith the highest priority in list @ o6q will be scheduled first.However, for the nodes with same priority, switching ac-tivities minimization is considered. Thus, we minimize theschedule length while considering switching activities.

4.3 EMSA Algorithm

In this section, an algorithm, Energy MinimizationScheduling Algorithm (EMSA), is proposed to solve thegeneral instruction-level energy-minimization schedulingproblem. The basic idea is to reschedule nodes to reduceswitching activities when relaxing the schedule length of aninitial schedule to a given timing constraint and then select

10

Head 2 3 41

2

3

4

1

(001)

(010)

(110)

(000)

(000) 2

2

1

1

22

1

FU2

FU1

(110)

1

V1 V2

M = (FU1, 1) , (FU2, 2)(weight=2 )

(001) FU1 <−− 1 (010)FU2 <−− 2

(c)

(a)

(d) (b)

Figure 6. (a) Ready List (b) Weighted Bipartite Graph (c) Min-cost Max Bipartite Matching (d) Theschedule for first step

the schedule with minimal energy consumption. A ' isshown in Algorithm 4.2.

Algorithm 4.2 EMSA (Energy Minimization Scheduling Algo-rithm)

Input: DAG , Bit String( ) for each , func-tional unit set ! " $s & , an initial schedule $%A%A , a timing con-straint & ]

Output: A schedule with minimum energy( ,the schedule length of $ A%A ; S , 1;

for all T ( to &p] do$sA ,

Relax Reschedule( , RSUT $VT@W@SUXY , ! " $s & ,$sA%A ,t);

end forSelect the schedule with minimum energy among $6Ag? S ? &p] ( D ;

In EMSA, a function, 4 : , is called to

generate a new schedule for each < from @ to m , where@ is the schedule length of the given initial schedule andm is the timing constraint. The schedule generated by 4 : has minimum switching activities be-cause the switching activities are reduced as much as pos-sible in

6 : . For each schedule, we com-pute its energy consumption based on equation 2. Then theschedule with minimum energy consumption is selected asthe output.

Function 6 : reschedules nodes to re-

duce switching activities during relaxing the initial sched-ule as shown in Algorithm 4.3.

6 : worksmostly like As-Late-As-Possible scheduling (ASAP). In 4 : , we reschedule nodes to new loca-tions based on an initial schedule and a timing con-straint from the bottom up. Since the timing constraintis equal to or greater than the schedule length of theinitial schedule, the nodes in the initial schedule mayhave some freedom to be moved to a new location such

Algorithm 4.3 Relax Reschedule()

Input: DAG , Bit String( ) for each , func-tional unit set !#" $%'& , an initial schedule $%A%A , a timing con-straint & ]

Output: A schedule with switching activities minimizationfor all do( aVT dg_@T $ TedM` ?ZD , &p] ; ^%_@T 5W'd@W+?ZD , the number of child nodes of ;end for d#a $'dbT , all nodes with

^%_@T W d@W in ;while

dka $ d@T - / dofor all in

dka $ d@T do]ptW $ TedM` , current schedule step of in $%A%A ;( ^sf $ dbT , all available locations that can be put intofrom ]ptW@W $ TedM` to

( aVT dg_@T $ TedM` ?Z D ;for all U^%f A in

( ^%f $'dbT do$ :=?U^%f AFD , the caused switching activities if ismoved to 8^%f A ;

end for[ SUX $ :=?ZD ,the minimum value among all

$ :=?U^%f AFD ;R d _@T ( ^%fV?ZD , 8^sfeA closest to the bottom with$ :=?U^%f AFDp [ SiX $V:=?ZD ;

end for , the node with minimum[ SUX $V:=?ZD among all nodes

in dka $'dbT ;

Reschedule to R d _@T ( ^%fV? D ;for all parent node of in do$6fgh $ TedM` , the schedule step of ;if( a Ted _@T $VT dF` ? D $%f h $VT dF`!

then( aVT dg_@T $ TedM` ? " D , $%f h $VT dF`# ;

end if ^%_@T 5W'd@W+? D$% ;end forRemove from

dka $ d@T and put new ready nodes with ^%_@T 5W'd@W & into dka $ d@T ;

end while

11

that the switching activities can be reduced. In order toobey the precedence relation, we associate two variables,@ < :< ' < ) 1 and :< ) 1 , to each node

, where @ < :< ' < ) 1 keeps track of the latest sched-ule step, and :< ) 1 keeps track of the number ofchild nodes of that have not been rescheduled yet. Weput all nodes with :< into set

' <and compute the best location of each node with two con-siderations: 1) it causes the greatest reduction in switchingactivity; 2) it is the closest to the bottom. Then we selectthe node with the greatest reduction in switching activityamong all nodes in

' < and reschedule it to its bestlocation. In this way, we can reduce switching activities asmuch as possible when rescheduling a node.

Since the schedule generated by A ' is directly re-lated to the initial schedule. It is very important to have agood initial schedule. Both

' ' and @ ' greatly

reduce switching activities, so we can use the schedulesgenerated by them as the initial schedules of A ' . Forexample, using Schedule 2 in Figure 5(c) as an initialschedule and assuming the timing constraint is 6 time units,Schedule 2 and Schedule 3 (Figure 5(d)) are obtained byA ' . The best schedule is selected based on our energymodel shown in equation 2. When the power coefficient (equation 2) is 1 and 4 is 1 in equation 2, Schedule 3is selected as the finial schedule.

5 Experiments

We experimented with our algorithms on a set of bench-marks including 2-cascaded Biquad filter, 4-stage latticefilter, 8-stage lattice filter, differential equation solver, el-liptic filter and voltera filter. When doing experiments withA ' , the schedules generated by

' ' and @ '

are used as the initial schedules, and the best results are se-lected. When computing the switching power, we assume in equation 2 and " 6 for each + in equation 3.The results are compared with those of the traditional listscheduling algorithm. In our experiments, the opcodes foreach node is obtained from TI TMS320C6000 InstructionSet [24].

The experimental results for ' ' and list scheduling

are shown in Table 1. Column “SA” lists the switching ac-tivities for list scheduling (field “ListS”) and

' ' (field“MSAS”) when the number of functional units varies from

to ) . Column “SL” lists the schedule length. Column “%”under

' ' lists the percentage of reduction in switchingactivity compared with list scheduling. The average reduc-tion is shown on the last row. Similarly, the experimentalresults for

@ ' and list scheduling are shown in Table2.

The experimental results in Table 3 and Table 4 showthe comparison on energy for A ' and list schedulingwhen and ( , respectively. Row “E.”lists the energy for various benchmarks. Row “SL” liststhe schedule length obtained by list scheduling. Row “TC”lists the timing constraint used in A ' . Each benchmark

is scheduled with three timing constraints. Among them,the first one is the schedule length obtained by list schedul-ing. Row “%” in field “EMSA” lists the reduction on en-ergy comparing A ' with list scheduling (field “ListS”).In each table, the results are shown when the number ofFUs equals 3,4,and 5, respectively.

Through the experimental results from Table 1-4, wefound our algorithms reduce energy consumption signifi-cantly compared with list scheduling. When switching ac-tivity plays the most important role in energy consumption, ' ' gives a reduction of 41.2% in switching activityon average compared with list scheduling. When sched-ule length plays the most important role in energy con-sumption,

@ ' gives the same schedule length as listscheduling and achieves a reduction of 33.2 % in switch-ing activity on average at the same time. For general cases,A ' gives a reduction of 31.2% in energy on averagecompared with list scheduling.

113

17 18 19 20 21 22 23 24 25

List

6865

59

90

Time* *

EMSA

8 Lattice IIR Filter (FUs = 4 )

Energy

Figure 7. The energy variation of 8-lattice IIRfilter with the increase of the timing con-straint when .

From the experimental results in Table 3 and Table 4, wecan see the energy based on A ' is reduced correspond-ingly with the increase of the timing constraint for eachbenchmark. To show the further energy variation, we showthe results for 8-lattice IIR filter when the timing constraintvaries from 17 to 25 in Figure 7. A ' gradually reducesthe energy when the timing constraint increases from 17to 23. After that, the energy can not be reduced anymore.This shows that we can not gain further energy reductionwhen the timing constraint reaches a certain point. At thispoint, A ' gives a reduction of 47.8% compared withlist scheduling when the timing constraint equals 23.

6 Conclusion

In this paper, we presented instruction level schedul-ing techniques to minimize the energy consumption of ap-plications executed on VLIW processors. We studied thetwo most important factors, switching activity and sched-

12

Bench. FUs=3 FUs=4 FUs=5 FUs=6ListS MSAS ListS MSAS ListS MSAS ListS MSAS

SA SL SA SL % SA SL SA SL % SA SL SA SL % SA SL SA SL %2-Cas-Biq 7 54 8 27 50.0 6 40 7 33 17.5 6 45 7 28 37.8 6 45 7 28 37.84-Lat-IIR 9 64 12 35 45.3 9 69 11 35 49.3 9 68 10 37 45.6 9 62 10 34 45.28-Lat-IIR 17 96 23 43 55.2 17 96 22 43 55.2 17 102 20 43 57.8 17 103 19 45 56.3DEQ 5 40 6 21 47.5 5 30 5 27 10.0 5 35 5 27 22.9 5 35 5 27 22.9Elliptic 15 143 16 90 37.1 14 142 14 93 34.5 14 142 14 85 40.1 14 142 14 85 40.1Voltera 12 60 15 31 48.3 12 57 14 29 49.1 12 56 13 32 42.9 12 57 13 34 40.4

Average Reduction % 47.2 – 35.9 – 41.2 – 40.4

Table 1. The comparison on switching activity and schedule length for ' ' and List Scheduling

when the number of FUs varies from 3 to 6.

Bench. FUs=3 FUs=4 FUs=5 FUs=6ListS MSAS ListS MSAS ListS MSAS ListS MSAS

SA SL SA SL % SA SL SA SL % SA SL SA SL % SA SL SA SL %2-Cas-Biq 7 54 7 28 48.1 6 40 6 29 27.5 6 45 6 23 48.9 6 45 6 30 33.34-Lat-IIR 9 64 9 46 28.1 9 69 9 48 30.4 9 68 9 40 41.2 9 62 9 38 38.78-lat-iir 17 96 17 71 26.0 17 96 17 73 24.0 17 102 17 57 44.1 17 103 17 47 54.4DEQ 5 40 5 31 22.5 5 30 5 27 10.0 5 35 5 27 22.9 5 35 5 27 22.9Elliptic 15 143 15 101 29.4 14 142 14 93 34.5 14 142 14 85 40.1 14 142 14 85 40.1Voltera 12 60 12 44 26.7 12 57 12 41 28.1 12 56 12 37 33.9 12 57 12 34 40.4

Average Reduction % 30.1 – 25.7 – 38.5 – 38.3

Table 2. The comparison on switching activity and schedule length for @ ' and List Scheduling

when the number of FUs varies from 3 to 6.

2-Cas-Biq 4-Lat-IIR 8-Lat-IIR DEQ Elliptic Voltera :

ListS SL 7 9 17 5 15 12E. 61 73 113 45 158 72

EMSA TC 7 8 9 9 11 14 17 20 24 5 6 7 15 18 21 12 14 17E. 35 30 28 55 49 41 88 75 66 36 27 26 116 96 89 56 55 47% 42.6 50.8 54.1 24.7 32.9 43.8 22.1 33.6 41.6 20.0 40.0 42.2 26.6 39.2 43.7 22.2 23.6 34.7

: *ListS SL 6 9 17 5 14 12

E. 46 78 113 35 156 69EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17

E. 35 31 29 57 46 39 90 68 59 32 28 28 107 79 74 53 43 43% 23.9 32.6 37.0 26.9 41.0 50.0 20.4 39.8 47.8 8.6 20.0 20.0 31.4 49.4 52.6 23.2 37.7 37.7

: ,ListS SL 6 9 17 5 14 12

E. 51 77 119 40 156 68EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17

E. 29 29 29 49 43 43 74 63 58 32 27 27 99 82 82 49 45 44% 43.1 43.1 43.1 36.4 44.2 44.2 37.8 47.1 51.3 20.0 32.5 32.5 36.5 47.4 47.4 27.9 33.8 35.3

Table 3. The comparison on energy for A ' and List Scheduling ( 4 ) when the number ofFUs varies from 3 to 5.

13

2-Cas-Biq 4-Lat-IIR 8-Lat-IIR DEQ Elliptic Voltera

: ListS SL 7 9 17 5 15 12

E. 68 82 130 50 173 84EMSA TC 7 8 9 9 11 14 17 20 24 5 6 7 15 18 21 12 14 17

E. 42 38 37 64 59 54 105 94 89 41 33 32 131 114 108 68 68 61% 38.2 44.1 45.6 22.0 28.0 34.1 19.2 27.7 31.5 18.0 34.0 36.0 24.3 34.1 37.6 19.0 19.0 27.4

: *ListS SL 6 9 17 5 14 12

E. 52 87 130 40 170 81EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17

E. 41 38 37 66 57 51 107 90 82 37 34 34 121 95 92 65 57 57% 21.2 26.9 28.8 24.1 34.5 41.4 17.7 30.8 36.9 7.5 15.0 15.0 28.8 44.1 45.9 19.8 29.6 29.6

: ,ListS SL 6 9 17 5 14 12

E. 57 86 136 45 170 80EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17

E. 35 35 35 58 53 53 91 81 79 37 33 33 113 98 98 61 59 59% 38.6 38.6 38.6 32.6 38.4 38.4 33.1 40.4 41.9 17.8 26.7 26.7 33.5 42.4 42.4 23.8 26.3 26.3

Table 4. The comparison on energy for A ' and List Scheduling ( ( ) when the number ofFUs varies from 3 to 5.

ule length, that influence the energy consumption of anapplication on a VLIW processor during scheduling. Weproved that instruction-level energy-minimization schedul-ing problem is NP-complete with or without resource con-straints. We proposed three heuristic algorithms, MSAS,MLMSA, and EMSA, to solve the problem. While switch-ing activity and schedule length are given higher priorityin MSAS and MLMSA respectively, EMSA gives the bestresult considering both of them. The experimental resultsshow our algorithms can significantly reduce energy con-sumption compared with the standard list scheduling.

References

[1] V. Tiwari, S. Malik, and M. Fujita, “Power analy-sis of embedded software: A first step towards soft-ware power minimization,” in Proceedings of theIEEE/ACM International Conference on ComputerAided Design, Nov. 1994, pp. 110–115.

[2] N. Chang, K. Kim, and H. G. Lee, “Cycle-accurateenergy measurement and characterization with a casestudy of the ARM7TDMI,” IEEE Trans. on VLSI Sys-tems, vol. 10, no. 2, pp. 146–154, Apr. 2002.

[3] J.M. Chang and M. Pedram, “Register allocationand binding for low power,” in Proceedings ofthe IEEE/ACM Design Automation Conference, June1995, pp. 29–35.

[4] M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fu-jita, “Power analysis and low-power scheduling tech-niques for embedded dsp software,” in Proceedings

of the IEEE International Symposium on System Syn-thesis, Sept. 1995, pp. 110–115.

[5] C. Lee, J.-K. Lee, and T. Hwang, “Compiler opti-mization on instruction scheduling for low power,” inProceedings of the IEEE International Symposium onSystem Synthesis, Sept. 2000, pp. 55–60.

[6] Hongbo Yang, , Guang R. Gao, and Clement Leung,“On achieving balanced power consumption in soft-ware pipelined loops,” in Proceedings of the Inter-national Conference on Compilers, Architectures andSynthesis for Embedded Systems, 2002, pp. 210–217.

[7] A. Raghunathan and N. K. Jha, “An ILP formulationfor low power based on minimizing switched capac-itance during data path allocation,” in Proceedingsof the IEEE International Symposium on Circuits andSystems, May 1995, pp. 1069–1073.

[8] L. Kruse, E. Schmidt, G. Jochens, A. Stammermann,A. Schulz, E. Macii, and W. Nebel, “Estimation oflower and upper bounds on the power consumptionfrom scheduled data flow graphs,” IEEE Trans. onVLSI Systems, vol. 9, no. 1, pp. 3–14, Feb. 2001.

[9] A. Parikh, M. Kandemir, N. Vijaykrishnan, and M. J.Irwin, “Instruction scheduling based on energy andperformance constraints,” in IEEE Computer SocietyAnnual Workshop on VLSI, Apr. 2000, pp. 37–42.

[10] C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Savingpower in the control path of embedded processors,”IEEE Design & Test of Computers, vol. 11, no. 4, pp.24–30, Winter 1994.

14

[11] E. Musoll and J. Cortadella, “Scheduling and re-source binding for low power,” in Proceedings of theIEEE International Symposium on System Synthesis,Apr. 1995, pp. 104–109.

[12] M. Mehendale, S.D. Sherlekar, and G. Venkatesh,“Coefficient optimization for low power realizationof fir filters,” in IEEE Workshop on VLSI Signal Pro-cessing, 1995, pp. 352–361.

[13] K. Masselos, S. Theoharis, P. K. Merakos,T. Stouraitis, and C. E. Goutis, “Low powersynthesis of sum-of-products computation,” in Pro-ceedings of the IEEE/ACM International Symposiumon Low Power Electronics and Design, July 2000,pp. 234–237.

[14] K.W. Choi and A. Chatterjee, “Efficient instruction-level optimization methodology for low-power em-bedded systems,” in Proceedings of the IEEE Inter-national Symposium on System Synthesis, Oct. 2001,pp. 147–152.

[15] J.A.A. V. D. Veen and S. Zhang G. J. Woeginger, “Se-quencing jobs that require common resources on asingle machine: a solvable case of the tsp,” Math-ematical Programming, pp. 235–254, 1998.

[16] E. Macii, M. Pedram, and F. Somenzi, “High-level power modeling, estimation and optimization,”IEEE Transactions on Computer-Aided Design of In-tegrated Circuits and Systems, vol. 17, pp. 1061–1079, November 1998.

[17] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zac-caria, and R. Zafalon, “A power modeling and estima-tion framework for VLIW-based embedded systems,”in Proceedings of the International Workshop-Powerand Timing Modeling, Optimization and Simulation,PATMOS’01, 2001, pp. 26–28.

[18] A. Bona, M. Sami, D. Sciuto, C. Silvano, V. Zaccaria,and R. Zafalon, “Energy estimation and optimizationof embedded VLIW processors based on instructionclustering,” in Proceedings of the IEEE/ACM DesignAutomation Conference, 2002, pp. 886–891.

[19] M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria, “In-struction level power estimation for embedded VLIWcores,” in Proceedings of the International Workshopon Hardware/Software Codesign, May 2000, pp. 34–38.

[20] M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria,“Power exploration for embedded VLIW architec-ture,” in Proceedings of the IEEE/ACM InternationalConference on Computer Aided Design, Nov. 2000,pp. 498–503.

[21] Texas Instruments, Inc., TMS320C6000 PeripheralsReference Guide (Rev. D), 2001.

[22] M. R. Garey and D. S. Johnson, Computersand Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Company, 1979.

[23] H.A.B. Saip and C. L. Lucchesi, “Matchingalgorithm for bipartite graphs,” Tech. Rep. DCC-93-03(Departamento de Cincia da Computao,Universidade Estudal de Campinas), March 1994,http://www.dee.unicamp.br/ic tr ftp/ALL/Abstrace.html.

[24] Texas Instruments, Inc., TMS320C6000 CPU and In-struction Set Reference Guide, 2000.

15

1

Robust Transmission of Wavelet Video Sequence Using Intra Prediction Coding Scheme

Joo-Kyong Lee, Ki-Dong Chung

Department of Computer Engineering, Pusan National University

[email protected], [email protected]

Abstract

A wavelet-based video stream is more susceptible to the network transmission errors than DCT-based

video. This is because bit-errors in a subband of a video frame affect not only the other subbands within

the current frame but also the subsequent frames. In this paper, we propose a video source coding scheme

called Intra Prediction Coding (IPC) in which each subband refers to the 1-level lower subband in the

same orientation within the same frame to reduce the error propagation to the subsequent frames. We

evaluated the performance of our proposed scheme in the simulated wireless network environment. As a

result of tests, it is shown that the proposed algorithm shows better performance than MRME on a heavy

motion image sequence while IPC outperforms MRME at high bit-rate on a little motion image sequence.

Keywords

Intra Prediction Coding, Wavelet-based video, Error Propagation, Error Resilience, MRME

1. Introduction

As the personal mobile devices are more and more popularized, network environment to transmit the

multimedia data is being moved from the current wired network to the wireless network. There has been a

lot of research in video coding to overcome the limited wireless network resources and some of them are

published as a part of DCT-based international standard such as H.263 [1] and MPEG-4 [2]. Besides,

several wavelet transform based image codecs [3] and several video codecs [4][5][6] have been studied

since they are efficient at a low bit-rate.

Most DCT-based standard video encoders perform motion compensated prediction to reduce temporal

redundancy between the current frame and the reference frame. However, this scheme causes error

propagation when a frame is corrupted during transmission over the network. To improve this problem,

the H.263 and the MPEG-4 standards include various error control schemes [7][8] as annex. The error

propagation problem in the wavelet-based video is more serious than that of DCT-based video. In Fig.1,

(a) shows an original image S of size M×N pixels and (b) shows a 3-level wavelet decomposition of the

image S, respectively. In (b), S is decomposed into totally 10 subbands; in the first level of decomposition,

one low pass subband and three orientation high pass subbands (LH1, HL1, HH1) are created. In the

second level of decomposition, the low pass subband is further decomposed into one low pass and three

2

high pass subbands (LH2, HL2, HH2). This process is repeated on the low pass subband to form higher

level wavelet decomposition. The inverse wavelet transform is calculated in the reverse manner to

reconstruct the original image. However, in case of corruption of the higher level subbands in the

transformed image during transmission over the network, the effect of the errors propagates to the lower

level subbands and the subsequent frames of the motion compensated video.

M/2M/4M/8

N/8

N/4

N/2

HL1

HH1LH1

LH2

LH3

HL3LL3

HH3

HL2

HH2

Horizontal size = M

Verticalsize= N

Origianl Image S

Fig. 1 Original image before wavelet transform (a) and 3-level wavelet transformed image

So far, there has been some research on error control for a wavelet-based image or video, but these are

limited to SPIHT (Set Partitioning In Hierarchical Trees) [9][10] scheme, none have been studied for

motion compensated wavelet-based video. In this paper, we propose a source coding scheme called Intra

Prediction Coding to reduce error propagation of a motion compensated wavelet-based video. In the

proposed scheme, each subbands estimates motion vectors from the 1-level lower subbands in the same

orientation within the same frame. However, the LLM1)

subband and the first level subbands (LH1, HL1,

HH1) are exempt from this rule. That is to say, the LLM subband performs motion estimation from the

LLM subband of the reference frame to exploit the high similarity between the two subbands, and the first

level subbands do not perform motion estimation from any other subbands by considering coding

efficiency and computation complexity. In the proposed scheme, if the LLM subband is protected from the

transmission errors even though the others are corrupted, the subsequent frames are free from the error

propagation.

This paper is organized as follows. In the next section, we explain MRME (Multi-Resolution Motion

Estimation) scheme, which perform motion compensated prediction from a previous frame. In Section 3,

we describe the detail procedures of IPC. And then, we present the results of the various simulations in

the error-prone environment. We conclude the paper with discussions and conclusions in the last section.

1) LLM is the low pass subband in the M-level wavelet transformed image

3

2. MRME (Multi-Resolution Motion Estimation)

MRME is a motion estimation scheme first proposed by Zhang and Zafar [4] as an alternative to the

conventional block-matching scheme and subband-to-subband prediction scheme to reduce the

computational load [11]. MRME estimates the motion vectors hierarchically from the higher level

subbands to lower level subbands and utilizes the motion information from the highest level subband(s) to

predict those from the lower subbands to reduce the computation load. There are several possible

variations in motion estimation according to selection of the highest-level subbands as a basis and

refinement process. Zhang and Zafar have implemented four schemes among all the possible variations.

In this paper, we use scheme III, in which MVs (Motion Vectors) in the LLM subband are calculated and

properly scaled as the initial biases for refining the MVs in all other subbands, to describe the algorithm

and to compare with our proposed scheme.

,5MV

,52 MV×

1,5MV −

1,5MV −∆M

2 2 : block size of a th-level subband : wavelet transform level: the highest wavelet transform level

: block size of LL subband

M m M mp q mmMp q

− −⋅ × ⋅

×

LLM

LHM-1

Fig. 2 Motion estimation of the lower level subband using MVs of the LLM subband

In MRME scheme, motion estimation consists of three major phases. Firstly, all the blocks of the LLM

subband estimate motion vectors from LLM subband of the reference frame. The detail estimation method

does the same as the conventional block-matching motion estimation scheme does. Secondly, the MVs

calculated in the first phase are properly scaled in proportion to the size of lower level subbands. Lastly,

the scaled MVs are refined through motion estimation within a reduced search area. Fig.2 shows an

example of MRME scheme. In the figure, to predict MV VM-1,5, VM,5 in the LLM subband is expanded to

2×VM,5 and the incremental motion vector, δ(x,y) is calculated within a reduced search area centered at

2×VM,5. Eq. (1) describes the calculation of all the MVs except the LLM subband where i is the wavelet

transform level, j is the location of subbands and (x, y) coordinate is block index in the LLM subband.

, ,0( , ) 2 ( , ) ( , ) (1)

for 0, 1, 2, ..., and 1, 2, 3

M ii j MV x y V x y x y

i M j

δ−= +

= =

4

3. Intra Prediction Coding

Fig. 3 shows that how IPC works in the 3-level wavelet transformed video frames. The LL3 subband of

a video frame contains a large percent of the total energy though it is only 1/64 of the original video frame

size. And it is similar to the LL3 of the previous frame since the subband displays the coarse contour of its

frame. Therefore, we predict MVs of LL3 of the current frame from LL3 of the previous frame by

considering coding efficiency. Other subbands except the first level subbands perform motion

compensated prediction from the 1-level lower subbands in the same orientation within the same frame.

In the wavelet-based video, there is a big difference between the lower level subband coefficients of the

shifted image and those of the original image. Such phenomena will happen frequently around the image

edges [14]. On this account, the first level subbands are exempt from the motion estimation from a

previous frame.

F(t-1) F(t)

LH2 HH2

HL2

LH1 HH1

HL1

reference direction

LL subband in the highest transform level

subbands used for motion estimated prediction

HL3

LH3

LL3

HH3

LL3

Fig. 3 Example of IPC on the 3-level transformed frames

3.1. Motion Estimation for the LLM subband

As we mentioned previously, The LLM subband is more important than any other subbands in a wavelet

transformed frame and contains very large coefficient values. The motion estimation process of this

subband is as fallows. Firstly, the subband is divided into blocks of same size of px×py except boundary

area. Next, for every block, Eq. (2) is applied to calculate the similarity between the current block and a

candidate block of the reference frame. In the equation, the cost function as the Sum of Absolute

Differences or SAD is defined in order to determine the differences between the current block and the

candidate blocks. Let a pixel of the current block in the current frame be denoted as LLtM(x+k, y+l) where

(x, y) is the coordinate of the left top corner of the current block and (k,l) is the relative location from (x,y)

5

in the block. And let a pixel of LL subband in the reference be LLt-1M(x+k+i, y+l+j) where ,p i j p− ≤ ≤ ; p

is the search area. More specifically, Eq. (2) sums up all the absolute differences between the

corresponding pixels of the current block and the candidate block that is located (i, j) away from the block

in the reference frame. In the results of Eq. (2), We get a number of SAD values according to the size of

the search range p and select the smallest value and its relative location from the current block as the MV

of the current block.

1 11

0 0

( , ) | ( , ) ( , ) | (2)

( , ) arg min ( , ), ,

px pyt tM M

k l

SAD i j LL x k y l LL x k i y l j

MV k l SAD i j p i j p

− −−

= =

= + + − + + + +

= − ≤ ≤

∑ ∑

3.2. Motion Estimation for Other subbands

After the motion estimation of the LLM subband, IPC calculates the MVs of the same level subbands,

LHM, HLM, HHM, and exploits these MVs to predict MVs of 1-level lower subbands. The motion

estimation procedure for the highest level subbands is shown in Eq. (3). There are three differences

between IPC and the conventional motion estimation scheme. First, all the subbands except LLM and the

first level subbands refer to the 1-level lower subbands in the same orientation within the same frame. For

instance, the subband s of level M refers to the subband s of level M-1. Secondly, because the coefficients

of the current subbands are larger than those of the reference subbands, we need a constant K to

complement the differences. Thirdly, the width and height of the reference subbands double those of the

current subbands, respectively. Therefore, the coordinate of the left top corner of the reference subband is

(2x, 2y) in the Eq. (3).

1 1

, , 1,0 0

, ,

( , ) | ( , ) (2 ,2 ) | (3)

( , ) arg min ( , ), , , 1 3

px py

M s M s M sk l

M s M s

SAD i j S x k y l K S x k i y l j

MV x y SAD i j p i j p s

− −

−= =

= + + − × + + + +

= − ≤ ≤ ≤ ≤

∑∑

Except the highest and the lowest level subbands, all subbands use MVs of the 1-level higher subbands

in the same orientation in Eq. (4). For instance, MVs of level m subbands are calculated by scaling MVs

of level m+1 subbands, and then the incremental motion vector, ∆(δx, δy) is calculated in the same

manner in Eq. (1) within a reduced search area. Note that this reduced search area is a part of 1-level

lower subband. This process repeats from the higher level subbands to the lower level subbands.

, 1,( , ) 2 ( , ) ( , ) (4)2 1

m s m sMV x y MV x y x ym M

δ δ+= + ∆

≤ ≤ −

3.3. Motion Compensation

IPC performs motion compensation for the LLM subband, same as the conventional block-matching

scheme and uses Eq. (5) for other subbands. In the Equation, Rm,s presents the motion compensated

6

residual signals of the subband s in the level m. These are the differences between the blocks of the

subband s in the level m and the best matching blocks in the 1-level lower subband.

, , 1, ,( ) (5)2 , 1 3

m s m s m s m sR S K S MVm M s

−= − ⋅

≤ ≤ ≤ ≤

Fig. 4 Encoding algorithm for Intra Prediction Coding scheme

4. Implementation and Simulation Results

4.1. Implementation

To evaluate the performance of our proposed scheme, we modified the Geoff Davis’s wavelet image

coder [12] to a motion compensated wavelet video coder. In the implemented video coder, each frame

except the first frame is encoded in either MRME or IPC according to its similarity to the previous frame.

In other words, the first frame is encoded as an I frame and the subsequent frames are encoded as motion

compensated frames by exploiting MAD (Mean Absolute Difference), an measure of difference between

Intra_Inter_ME( )

/* Input Parameters : c_frame : the wavelet transformed current frame, prev_frame : the reference frame,

max_level : the highest DWT level, block_numbers : the number of blocks in a subband.

Output Parameters: mc_frame : motion compensated frame, MV : the set of calculated MVs */

M = max_level ; /* The highest DWT level */

BN = block_numbers ; /* The number of blocks in a subband */

For (i = 0; i < BN; i++) /* ME/ME for LLM subband */

MV[M][0][i]= Do_LL_MEMC(i, c_frame, prev_frame) ; /* use of Eq. (2) */

For (j = 1; j <= 3; j++) /* for LHM, HLM, HHM subbands */

For (i = 0; i < BN; i++) /* performs motion estimation */

MV[M][j][i] = Intra_pred(M, j, i, c_frame) ; /* use of Eq. (3) */

For (k = M-1; k > 1; k--) /* for lower level (2 ~ M-1) subbands */

For (j = 1; j <= 3; j++)

For (i = 0; i < BN; i++)

MV[M][j][i] = Intra_pred(MV, k, j, i, c_frame) ; /* calculation of MVs : Eq. (4) */

mc_frame = Do_MC(MV, c_frame, prev_frame) ; /* Motion compensation all subbands except

LLM subband : Eq. (5) */

7

the two objects. In Fig. 5, the compression mode of the current frame is determined after motion

compensation of LLM subband. That is, if MAD of the two LLM subbands is greater than a given

threshold T, the encoder operates in the MRME mode to estimate MVs, else it operates in the IPC mode.

Here, because MAD is an average difference value between the coefficients of the neighboring LLM

subbands, as a motion changes rapidly in a video sequence, MAD becomes higher. Therefore, it is

reasonable to use the IPC scheme in a heavy motion video sequence by considering coding efficiency and

error propagation rate. Eq. (6) shows the definition of MAD between the neighboring LLM subbands

where H, V are the horizontal and the vertical size of the LLM subband and ( , )tMLL i j is a coefficient

value of the coordinate (i, j) in the LLM subband in a frame f(t). 1 1

1

0 0

1( ) ( ) | ( , ) ( , ) | (6)H V

t t tM M M

i j

MAD LL LL i j LL i jMN

− −−

= =

= −∑∑

After motion compensation, the implemented video coder performs quantization and VLC (Variable

Length Coding) as they are employed in the Geoff Davis’s wavelet image coder in Fig. 5.

Input Video

DWT

Coding Control

LLME

MC

MVs

IPC

Threshold

MRME

MC

MC

MVs

MVs

Frame Memory

Q VLC

Q-10

-+ Bit Stream

MVs

Controls

-

+

++

DWT : Discrete Wavelet TransformLLME : Motion Estimation for LL subbandMC : Motion CompensationIP : Intra Prediction CodingMRME : Multi-Resolution Motion EstimationQ : QuantizationVLC : Variable Length CodingMVs : Motion Vectors

Fig. 5 Block diagram for the implemented video coder

4.2. Simulation Results

The proposed scheme was tested on the different motion rate video sequences. Table. 1 shows the

parameters of tested video sequences and wavelet transform for experiments. As previously stated, there

is a big difference between the higher and lower level subbands in the coefficient values. So, we need to

complement this difference using a constant K as shown in Eq. (3). To select a proper constant value, we

compared MADs of various video sequences by varying the constant value. Consequently, we came to the

conclusion that the integer 3 is the best value even though there is a little difference among several video

8

sequences. The tested video sequences are the standard QCIF (176×144) sequences, whose original frame

rates are 30 frames/s. But, the frame rates are reduced to 10 frames/s to demonstrate the performance at

low bit-rate. And, we assume that the first four frames are protected from the errors to allow the decoder

to stabilize into a steady state before the errors are introduced.

<Table. 1> the parameters for simulation

Parameters Description

Format of tested video QCIF(176×144)

File name Claire.qcif Foreman.qcifVideo

sequences Motion rate Low High

Original video 20 fps Frame rate

Coded video 10 fps

Maximum DWT level 3

Filters for wavelet transform Antonini 7/9

Complement constant K 3

In the simulation, we used the average luminance PSNR (Peak Signal-to-Noise Ratio) as a video quality measure. The PSNR is used as a measure of image quality and defined by 2 210 log 255 / eσ where

2eσ is the mean-square error between the original image and the processed image. For the sake of

experimental convenience, we assume that a single block row in a subband is a synchronization group and

a corruption of a block causes other blocks’ corruption in the same group, such as GOB (group f block) in

the H.263 standard. To hide a corruption, the decoder employs a simple error concealment technique,

previous frame concealment. In this approach, the corrupted blocks are replaced by corresponding pixels

of the reference frame.

G B

p

q

1-p 1-q

Fig. 6 A two state Markov channel

In the experiment, Gilbert’s error model [13] and WCDMA network errors are employed to investigate

the performance of the proposed algorithm in the wireless network environment. Gilbert’s error model

was proposed to model the error characteristics of a wireless channel between two states as shown in Fig.

6. It is a two-state Markov chain with the states named Good and Bad. Every state is assigned a specific

constant bit error rate (BER), eG in the good state, eB in the bad state ( G Be e= ). The state transition

probability p is the probability that the next state is the bad state, given that the current state is good state

9

and q is the probability of the reverse direction. In the simulation, we fixed eG to 0.0001972644427729, eB

to 0.1 and q to 0.00078125 while we varied p from 10-4 to 10-1 to simulate the various network states.

First, We evaluated the performance of the two schemes, IPC and MRME, at different bit-error rate

ranging from 10-4 to 10-1 as shown in Fig.7 and Fig. 8. In the figures, the symbol T is a given threshold to

determine the motion estimation mode, IPC or MRME. And as the value of T goes down, the IPC ratio

goes up in a video sequence. Fig. 7 shows an average PSNR performance on the Foreman video sequence

at different ratio of IPC, 100%, 90%, 55%, 5% and 0% when T is 2, 5, 7, 10, and 15, respectively. As the

IPC ratio approaches 0%, the MRME ratio goes up to 100% on video sequence. It can be supposed that

the output bit-rates are different in accordance with the IPC ratio. As a result of the experiment, we

became aware of the fact that they are almost same on a heavy motion video sequence, such as Foreman.

In Fig.7, the compressed bit-rate is about 424Kbits/sec. Fig. 7 shows that as the proportion of IPC

increases, the PSNR performance becomes higher regardless of BER. This is because IPC suppresses

error propagation to the subsequent frames by referring to the same frame.

1E-4 1E-3 0.01 0.1

24.5

25.0

25.5

26.0

26.5

27.0

27.5

28.0

28.5

29.0

PS

NR

(dB

)

BER(Bit Error Rate)

T=2 T=5 T=7 T=10 T=15

Fig. 7 Comparison of the IPC and MRME on Foreman sequence at different bit-error rates

In Fig. 8, we tested on the Claire video sequence, which is a little motion video. In this experiment, the

given thresholds are smaller than those of Foreman, because the neighboring frames are very similar to

each other. The Claire video sequence is encoded at the rate of 100%, 75%, 30%, 10% and 0% when T is

0.5, 1, 1.5, 2, and 10, respectively. Contrary to the result of Foreman, the output bit-rate increases in

proportion to the IPC ratio on the Claire video sequence. In the figure, the bit-rate for the proposed

scheme is 599Kbits/sec, while that for the pure MRME is 573Kbits/sec. The proposed scheme requires

about 4% more bit-rate. This is due to the high coding efficiency of MRME, which performs motion

10

estimation from a very similar previous frame. In the figure, as the proportion of IPC increases, the PSNR

performance becomes higher, similarly to the result on Foreman. This state becomes more prominent as

the bit-error rate increases. This is because even a simple error concealment scheme is effective at a low

bit-error rate on the little motion video such as Claire but, as the bit-error rate increases, the error

concealment takes little effect.

1E-4 1E-3 0.01 0.136

37

38

39

40

41

42

43

44 T=0.5 T=1 T=1.5 T=2 T=10

PSN

R(d

B)

BER(Bit Error Rate)

Fig. 8 Comparison of the IPC and MRME on Claire sequence at different bit-error rates

0 5 10 15 20 25 30 35 40 45 50 55 6022

24

26

28

30

32

34

36

PS

NR

(dB

)

Frame Number

T=2 T=5 T=7 T=10 T=15

Fig. 9 Comparison of PSNR on Foreman image sequence from the fram 1 to the frame 60

11

0 5 10 15 20 25 30 35 40 45 50 55 6034

36

38

40

42

44

46

48

50 T=0.5 T=1 T=1.5 T=2 T=10

PSN

R(d

B)

Frame Number

Fig. 10 Comparison of PSNR on Claire video sequence from the fram 1 to the frame 60

Fig. 9 shows the PSNR performance of each frame at the rate of 10-3 BER on the Foreman video

sequence. From 2nd to 12th frame the pure MRME shows higher PSNR than any other elements, because

there is hardly motion change during initial period on Foreman. However, from the 13th frame the

situation is reversed itself. Especially, the curves of T=2 and T=5 show high PSNR, and the curve of the

pure MRME declines slowly as time goes by. This means that accumulated distortion sharply decreases

the performance of MRME as time passes since it always refers to a previous frame. However, the curves

of IPC when T=2 and T=5 show different results that PSNR rises and falls regardless of time passage.

This is due to the characteristics of IPC that refers to the current frame and not the previous frame.

Fig. 10 shows the results of experiment on the Claire video sequence in the same condition with Fig.9.

And the results are very similar to those of Fig.9 on the whole except that the curves of T =0.5 and T=1

show different results with each other.

The reconstructed 51st frames at various threshold values on Foreman are shown in Fig. 11. As the IPC

ratio goes up, the frames look clearer and smoother. On the contrary, at high MRME rate, the frames look

rough and noisy. Through this result we can see the effect of errors on the lower level subbands. In the

case of MRME, when a bit-error occurs in a subband, the error propagates to not only the lower level

subbands within the same frame but also the same and lower level subbands of the subsequent frames.

This causes the reconstructed image to look rough and noisy. However, in the case of IPC, since the lower

level subbands except LLM subband are not influenced by the errors of the previous frame, the

reconstructed image looks clear and smooth.

12

Fig. 11 Reconstructed real image at different threshold value on Foreman sequence

Fig. 12 Reconstructed real image at different threshold value on Foreman sequence

Fig. 12 shows the reconstructed 51st frames on Claire in the same condition with Fig.10. In the figure,

the visual quality of the IPC coded image also looks better than that of MRME, especially the woman’s

face. This is because that the probability of error propagation of IPC is lower than that of MRME, and

that MRME cannot properly recover errors on the face in motion using a simple error concealment

scheme.

In the next experiment, we investigate the performance of reconstructed images under simulated

T=2

T=7 T=10 T=15

Original Image T=5

Original Image T=0.5 T=1

T=1.5 T=2 T=10

13

CDMA network environment. The most important simulation parameters used to produce the error files

are listed below.

Table. 2 The simulation parameters for CDMA network

Parameters Values

Bit-rate 200Kbps ~ 1.7 Mbps

Carrier frequency 1.9GHz

Doppler frequency 5.3, 70, 211Hz

BER 10-4, 10-3

Channel type Vehicular A

Fig. 13 compares an average PSNR of Foreman at different threshold values when a Doppler frequency

is 70Hz, and BER is fixed to 10-4 and 10-3, respectively. The simulation results show that IPC is more

efficient in PSNR than MRME regardless of BER change. And, at a low bit-rate, the PSNR of T=5 and

T=7, which means the coded video sequence is composed of IPC and MRME scheme, outperforms the

MRME scheme. Besides, the simulations results of 5.3 and 211Hz of Doppler frequency is very similar.

0 200 400 600 800 1000 1200 1400 1600 180022

24

26

28

30

32

34

36

38

40

42

44

0 200 400 600 800 1000 1200 1400 1600 180023

24

25

26

27

28

29

30

31

32

33

34

35

36

37BER=10-3BER=10-4

PS

NR

(dB

)

Bitrate(Kbps)

T=2 T=5 T=7 T=10 T=15

PS

NR

(dB

)

Bitrate(Kbps)

T=2 T=5 T=7 T=10 T=15

Fig. 13 The comparison of PSNR of Foreman image sequence in the CDMA network environment

On the Claire video sequence, Fig. 14 shows the complex simulation results in PSNR. The PSNR of IPC

decreases as bit-rate becomes lower while that of MRME does not change so much. It can be inferred that

14

the used quantization scheme in the Geoff Davis’s image coder affects the performance of the proposed

scheme. That is, the lower level subbands are referred from the higher level subbands in the proposed

scheme, but the used quantization scheme quantizes the lower level subbands with larger step size.

Therefore, as the quantization level goes higher, the lower level subbands are quantized greatly. This

decreases sharply the PSNR of IPC particularly on little motion video such as Claire.

250 300 350 400 450 500 550 600 650 700

36

38

40

42

44

46

250 300 350 400 450 500 550 600 650 700

36

38

40

42

44

46

PS

NR

(dB

)

Bitrate(Kbps)

T=0.5 T=1 T=1.5 T=2 T=10

BER=10-3BER=10-4

PSN

R(d

B)

Bitrate(Kbps)

T=0.5 T=1 T=1.5 T=2 T=10

Fig. 14 The comparison of PSNR of Claire video sequence in the CDMA network environment

5. Conclusion

In this paper, we proposed a source coding scheme called Intra Prediction Coding scheme for a wavelet

based video to control error propagation. In the proposed scheme, only LLM subband in the current frame

estimates MVs from the reference frame and other subbands search MVs from the 1-level lower subbands

within the same frame. Therefore, if LLM subband is preserved from transmission errors, the errors are

protected from propagation to the subsequent frame. We implemented wavelet video coder by modifying

the Geoff Davis’s image coder and tested the performance of the proposed scheme using Gilbert’s model

and CDMA network model. It was demonstrated by the intensive computer simulation that the proposed

scheme provides better performance than MRME on heavy motion frame sequence while IPC

outperforms MRME at high bit-rate on a little motion frame sequence. It can be inferred that the used

quantization scheme affects the proposed scheme. So, we will design a new quantization scheme for our

proposed video coder to improve the coding efficiency in the near feature.

15

6. References

[1] “Video coding for low bitrate communication", International Telecommunications Union, Geneva,

Switzerland, ITU-T Recommendation H.263, 1998.

[2] R. Koenen, "Overview of MPEG-4 standard", in ISO/IEC JTC1/SC29/WG11 N4668,

http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm

[3] C. Christopoulos, A. Skodras, Touradj Ebrahimi, "The JPEG2000 still image coding system: An

Overview," IEEE Trans. Consumer Electronics, Vol. 46, No. 4, pp. 1103-1127, Nov., 2000.

[4] Y. Zhang and S. Zafar, "Motion-compenated wavelet transform coding for color video compression,"

IEEE Trans. Circuits Syst. Video Technol., Vol.2 pp.285-296, Sept. 1992.

[5] J. M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients," IEEE Transactions

on Signal Processing, Vol. 41, No. 12, pp.3445-3462, Dec. 1993.

[6] Amir Said, William A. Pearlman, "A New Fast and Efficient Image Codec Based on Set Partitioning

in Hierarchical Trees", IEEE Trans. CSVT, Vol. 6, No. 3, pp.243-250, June, 1996.

[7] S. Wenger, G.Knorr, J. Ott, F. Kossentini: "Error resilience support in H.263+,″ IEEE Trans. on

circuit and System for Video Technology, Vol. 8, No. 6, pp. 867-877, Nov. 1998.

[8] Y. Takashima, M. Wada, and H. Murakami, "Reversible variable length codes," IEEE Trans.

Communications., Vol. 43, pp. 158-162, Feb./Mar./Apr. 1995.

[9] AA Alatan, and M. Zhao, "Unequal error protection of SPIHT encoded image bitstreams," IEEE

JASC, Vol. 18, No. 6, pp. 814-818, June 2000.

[10] S. Cho and W. A. Pearlman,"Error Resilient Video Coding with Improved 3-D SPIHT and Error

Concealment," SPIE/IS&T Electronic Imaging 2003, Proceedings SPIE Vol. 5022, Jan. 2003.

[11] J. Zan, M.O. Ahmad, M.N.S Swamy, "New techniques for multi-resolution motion estimation,"

IEEE Trans. on Circuits and Systems for video technology, Vol. 12, No. 9 Sept. 2002.

[12] http://www.geoffdavis.net/

[13] J. Ebert, A. Willig. “A Gilbert-Elliot Bit Error Model and the Efficient Use in Packet Level

Simulation,” Technical Report, TKN-99-002, Technical University of Berlin, March 1999. [14] H.S. Kim and H.W. Park, “Wavelet-based moving-picture coding using shift-invariant motion

estimation in wavelet domain,” Signal Processing: Image Communication, Vol. 16, 669-679, April, 200l.

Session 4: Language and Computation Model

Problems with using MPI 1.1 and 2.0 as compilation targetsfor parallel language implementations

Dan Bonachea and Jason DuellComputer Science Division

University of California, BerkeleyBerkeley, California, USA

E-mail:[email protected], [email protected]

Keywords: MPI one-sided, parallel languages, global ad-dress space, RMA, one-sided communication.

Abstract

MPI support is nearly ubiquitous on high performancesystems today, and is generally highly tuned for perfor-mance. It would thus seem to offer a convenient “portablenetwork assembly language” to developers of parallel pro-gramming languages who wish to target different networkarchitectures. Unfortunately, neither the traditional MPI1.1 API, nor the newer MPI 2.0 extensions for one-sidedcommunication provide an adequate compilation target forglobal address space languages, and this is likely to be thecase for many other parallel languages as well. Simulatingone-sided communication under the MPI 1.1 API is too ex-pensive, while the MPI 2.0 one-sided API imposes a numberof significant restrictions on memory access patterns thatthat would need to be incorporated at the language level,as a compiler can not effectively hide them given currentconflict and alias detection algorithms.

1 Introduction

Global address-space (GAS) languages, such as UnifiedParallel C (UPC) [10], Titanium [31], and Co-Array For-tran [27], are an emerging class of languages that seek togive parallel application developers an alternative to the tra-ditional message passing model. By providing the illusionof a shared address space to a distributed program (regard-less of the underlying hardware), along with programmaticawareness and control of memory layout across processors,they promise to combine the performance of message pass-ing systems with the convenience of shared memory pro-gramming.

As researchers involved in developing portable imple-mentations of two of these languages (Berkeley UPC [7]

and Titanium [31]), we have invested (and continue to in-vest) substantial amounts of effort into porting our softwareto new network architectures and programming interfaces.

Many other scientists and hardware vendors, upon learn-ing this, have asked us why we do not simply write our lan-guage on top of MPI, to avoid this porting effort. As themost common parallel programming interface on contem-porary supercomputing platforms, MPI has typically beenhighly tuned for performance. It has also been described bysome of its developers as providing an “assembly languagefor parallel processing” [13]. Presumably, then, it oughtto provide an efficient, portable target for parallel languagecompilers.

Unfortunately for GAS language implementors, this isnot the case. This paper explains why both the tradi-tional MPI 1.1 two-sided interface [25] and the newer MPI2.0 [24] one-sided interface (hereafter referred to as MPI-RMA) are inadequate compilation targets for global addressspace languages. The two-sided communication model in-herent in MPI 1.1 cannot support the one-sided commu-nication requirements of GAS languages as efficiently aslower-level APIs. The newer one-sided MPI-RMA presentsan interface that, while undoubtedly of great use to manyapplication developers, presents a number of restrictions onmemory access patterns that would require conflict and aliasanalysis that is beyond the reach of current compilers. Thedifficulties presented by MPI-RMA as a compilation targetare not specific to GAS languages, and are likely to presentan obstacle to any parallel programming language whichdoes not expose application developers to MPI-RMA’s nu-merous restrictions on memory usage patterns.

1.1 Global Address Space languages and theircommunication requirements

The common characteristic of all GAS languages is thatthey provide programmers with a shared memory abstrac-tion between all processors in an application (regardless of

1

whether shared memory is actually supported natively inhardware), and that the programmer is given both knowl-edge and control of the layout of shared memory across pro-cessors. The shared memory model of GAS languages pro-vides the programmer with a familiar interface in which anypiece of shared memory can be accessed by any processor inan application via the normal data access mechanisms builtinto the language, i.e., by accessing an element of a sharedarray, dereferencing a pointer to shared data, or simply ref-erencing the name of a shared variable. Of course, on manyplatforms the access time for touching shared memory willdepend heavily on the location of the data, and so GAS lan-guages make the layout of shared memory explicit to theprogrammer, who is encouraged to write code that takes ad-vantage of locality in order to achieve higher performance.

This combination of a globally shared address spacewith locality information allows an incremental develop-ment model, in which serial or shared-memory codes caninitially be naıvely ported to a distributed environment, pro-filed to isolate bottlenecks, and then gradually tuned to op-timize performance. The compiler for a GAS languagemay also be able to hide some or all of the latency cost ofremote accesses by scheduling unrelated computation (ormore communication) during the interim imposed by net-work traffic.

GAS languages have some differences in how they allowshared memory to be allocated, and/or how they distinguishshared and non-shared data. UPC and Co-Array Fortran re-quire the programmer to explicitly declare data objects tobe either shared or private (usage varies by application, buttypically the major data structures reside in shared space).In Titanium, by contrast, all heap and static data can poten-tially be accessed remotely [14], although a sophisticatedcompiler escape analysis [19] can detect which objects arepotentially “shared” (interesting applications typically haveabout 50-100% of the total bytes allocated judged to beshared by the analysis). Dynamic allocation of shared mem-ory is required to be collective in Co-Array Fortran, while itcan also be achieved in UPC and Titanium through purelylocal, non-collective operations. UPC additionally allowsshared objects to be allocated on remote processes usinga non-collective operation (upcglobal alloc()) with no ex-plicit cooperation from the process allocating the data.

Remote data access in GAS languages naturally leads toa one-sided communication pattern. In part, this is a mat-ter of providing a familiar and convenient interface to pro-grammers who are used to a shared-memory programmingmodel, but there is also an important class of dynamic andirregular applications that we wish to support which havecommunication patterns that are data-dependent and there-fore statically unpredictable, and hence are most naturallyexpressed using one-sided operations.

In GAS languages, the ability to predict data access pat-

terns is further limited by the fact that all three of these GASlanguages allow accesses to shared objects with affinity tothe calling thread via “local” pointers, and these accessesare indistinguishable from accesses to purely local (private)objects, i.e., the languages’ semantics provide no explicitinformation about whether the memory being accessed ispotentially shared or not. In each language, accesses toshared data which reside locally using “local” pointers gen-erally provide significantly better performance than accessthrough “global” pointers. In fact, the performance impactis so dramatic that most UPC programmers specifically op-timize for this case, and the Titanium compiler includes aspecialized analysis [18] to automatically infer when such atransformation is provably legal.

A result of these data dependent access patterns andshared/local aliasing is that compilers for GAS languagesgenerally have no way to know a priori which specificshared memory locations will be accessed remotely, orwhen they will be accessed. They thus have no way to stat-ically check, for instance, if data accesses from differentprocessors will conflict at runtime. This is exactly analo-gous to the problem of static alias analysis in sequential,pointer-based languages, but is complicated by the exis-tence of multiple independent threads of control – for in-depth discussion, see [16, 28, 23].

Given these properties of GAS languages, they presentthe following requirements for any underlying network in-terface:

• The ability to perform (or at least simulate) one-sidedcommunication (i.e., where only the initiator is explic-itly involved and the communication proceeds inde-pendently of any action taken by the remote threads).While one-sided communication can be simulated bythe language runtime using an underlying two-sidedcommunication API, neither the user nor the compilermay be exposed to any aspect of implementing mes-sage receipts (or other bookkeeping) on the remoteside of an access.

• Good latency performance for small remote accesses.Small messages are commonly used in initial imple-mentations of many GAS programs, and may be un-avoidable even in well-tuned applications in certainproblem domains. The performance of applicationswritten in global-address space languages is thus oftenvery sensitive to network round-trip latency.

• The ability for the compiler to hide network laten-cies by overlapping communication with computationand/or other communication, through the use of non-blocking remote accesses. This requires that the soft-ware overhead involved in network traffic be less thanthe latency of messages (and/or the gap between how

2

often the network interface will accept overlappingmessages), as overlapping is impossible if the hostCPU is busy throughout a messaging operation.

• Support for arbitrary access patterns to shared data.This includes allowing concurrent remote accesses tothe same regions of shared memory from different re-mote processors, as well as support for allowing shareddata to be accessed at arbitrary points in the programvia local pointers.

• Support for using or implementing collective commu-nication and synchronization operations.

The rest of this paper shows that neither the MPI 1.1 orthe newer MPI-RMA interface supports all of these charac-teristics adequately.

2 MPI 1.1 as a compilation target for GASlanguages

The most widely available and portable software inter-face for programming distributed memory machines to-day is that provided by the MPI 1.1 specification, whichhas been implemented and carefully tuned on most con-temporary high-performance parallel systems. As such, itpresents a very tempting compilation target for GAS lan-guages, which could potentially run on all such systems bysimply providing a single networking layer that runs on topof MPI.

Communication under MPI 1.1 is strictly two-sided: alltraffic takes the form of message sends, which requirematching receive operations to be explicitly issued on thereceiving side. While this does not neatly match the GASmodel of one-sided communication, it does not disqualifyMPI 1.1 from use by GAS languages. Our group has imple-mented an MPI layer [1] which transparently handles mes-sage receipt in the runtime by periodically polling for newmessage receipts, thus providing the illusion of a one-sidedinterface to both the user and the compiler. We have usedthis MPI layer on a number of platforms, including the IBMSP, the Cray T3E, the SGI Origin 2000, Linux and CompaqAlphaserver systems using Quadrics network hardware, andLinux clusters using Myrinet, Dolphin/SCI, and Ethernetnetworks. This portable MPI layer has proven to be an in-valuable prototyping tool for quickly deploying our systemson new architectures.

However, as Figure1 shows, use of MPI 1.1 implemen-tations does not come without a cost. This microbench-mark data gathered by our group (and described in detailin [4]), demonstrates that the latency and/or software over-head associated with small message sends in MPI is typ-ically much higher than when using native network APIs.This difference is most pronounced on the lowest-latency

systems which are the most likely targets for GAS appli-cations, with more than a factor of five difference betweenMPI performance and that of some native machine APIs.

It should be noted that the MPI data in Figure1 weregathered using nonblocking send and receive calls, and thatthese calls tend to incur higher overhead than their block-ing counterparts in most MPI implementations [4]. It isunclear whether these nonblocking overheads are unavoid-able, or are simply the result of less tuning by vendors:most MPI applications use blocking functions, and ven-dor benchmarks generally report only numbers for blockingsend/recv calls. Use of at least some nonblocking calls isnecessary when simulating one-sided communication witha two-sided API, in order to prevent deadlock (which couldotherwise be caused trivially by two processors sending amessage to each other at the same time). Nonblocking callsare also desirable in a GAS language context, as compilersmay be able to overlap computation (and/or other networkcalls) within the interval where the CPU would otherwisebe blocked.

To more easily support porting our UPC and Titaniumimplementations, we use a portable, high-performance net-working library called GASNet [5]. This is the layer whichprovides the one-sided messaging abstraction over MPI 1.1.Besides MPI 1.1, GASNet has also been implemented di-rectly over the native APIs of a number of high-performancenetworks (the IBM SP’s LAPI [17], Myrinet’s GM [12],Quadrics’ elan [11], and Infiniband [15]). UPC and Tita-nium applications can switch the underlying network usedwith a simple recompilation, and this gives us the opportu-nity to directly compare the performance of a single appli-cation run using MPI 1.1 for communication with its tim-ing when run over a lower-level network API. Figures2,3, and5 show the difference in performance between a setof UPC applications running on an Compaq Alphaserversystem with a Quadrics interconnect: a naıve and a bulk-synchronous version of the NAS Parallel Benchmarks [2]Conjugate Gradient, and a bulk-synchronous implementa-tion of the NAS Multigrid benchmark. Figures4 and6 showthe same bulk CG and MG codes running on an x86-LinuxMyrinet cluster. Finally, Figure7 shows the performanceof the MG benchmark on an IBM SP Power3. The appli-cations were compiled with the Berkeley UPC Compilerv1.0beta [7] to use either MPI-1.1 or the lower-level vendor-specific networking layer for communication. For compari-son purposes, on the Compaq system we also show the per-formance for the same applications when compiled with the2.1-003 version of the Compaq UPC compiler [8], whichcompiles executables to use the vendor-specific elan API(data for Compaq UPC is omitted for higher numbers ofnodes due to a recently-discovered performance bug whichis still pending). In all cases, the network and system hard-ware being used is identical–the only difference is the run-

3

Figure 1. Send and receive software overheads ( os and or) superimposed on the one-way end-to-endlatency ( EEL) for 8-byte messages on various high-performance systems. For MPI on the T3E andMyrinet, the sum of the overheads is greater then EEL, and so os = S + V and or = R + V . For theother configurations os = S and or = R.

IyneSPtayneBulk

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20 25 30 35nodeM

FLOePSMeg)

Berueley-OpLCoOpaq-elanBerueley-elan

Figure 2. Performance for a na ıve (fine-grained) UPC implementation of NAS Conjugate Gradientusing different network APIs/compilers on a Compaq AlphaServer

4

IyneSPtaBulkakyP-sc-

0

.00

200

400

600

8000

8.00

8200

0 1 80 81 .0 .1 50 513node

MFLOPS

gd)Bdrduldry3-npCyaldry3gd)BdrdulpCq

Figure 3. Performance for a bulk-synchronous UPC implementation of NAS Conjugate Gradient usingdifferent network APIs/compilers on a Compaq AlphaServer

IyneSPtaBulk-sySchnoSousa23

0

.0

200

2.0

400

4.0

600

6.0

0 4 8 1 5 20 24 28 21 253node

MFLOPS

g)Brn3ouly)-lBrn3ouly

Figure 4. Performance for a bulk-synchronous UPC implementation of NAS Conjugate Gradient usingdifferent network APIs with the Berkeley UPC compiler over Myrinet/GM

5

IyneSPtaBulkakyP-sc

0

.000

2000

4000

6000

80000

8.000

82000

0 1 80 81 .0 .1 50 513node

MFLOPS

gd)Bdrduldry3-npCyaldry3gd)BdrdulpCq

Figure 5. Performance for a bulk-synchronous UPC implementation of NAS Multigrid using differentnetworks APIs/compilers on a Compaq AlphaServer

IyneSPtaBulk-sySchnoSousaI2

0

.00

200

400

600

8000

8.00

8200

8400

0 1 80 81 .0 .1 50 513node

MFLOPS

g)Brn3ouly)-lBrn3ouly

Figure 6. Performance for a bulk-synchronous UPC implementation of NAS Multigrid using differentnetwork APIs with the Berkeley UPC compiler over Myrinet/GM

6

IyneSPeytaBulk-scho-otlen2

0

.00

2000

2.00

4000

4.00

0 . 20 2. 40 4. 60 6.8153n

odeMFL

OPSg)B185rgulSg)B185rgu

Figure 7. Performance for a bulk-synchronous UPC implementation of NAS Multigrid using differentnetwork APIs with the Berkeley UPC compiler on an IBM SP

time system and communication software in use.

The results show that the MPI 1.1-based GASNet layeris significantly outperformed by those based directly overone-sided networking APIs. The communication costs ofthe bulk-synchronous CG code are dominated by bulk putoperations (avg 56KB each) and some small (8 byte) gets,whereas the MG code is dominated by bulk get operations(avg 100KB each) and some small (8 byte) gets. For thetwo bulk synchronous benchmarks, the performance gap isless a function of any inherent difference between the MPIbandwidth and that of the lower-level API (the two tend tobe similar on each system), than of the cost of providing theillusion of one-sided communication under MPI 1.1.

Currently, our GASNet-MPI implementation works byhaving each host post a set of nonblocking receive calls inadvance, with varying buffer sizes, so that a buffer of thecorrect size is generally immediately available for any in-coming message. This allows a remote processor to com-plete a message send before the target processor may evenbe aware that the message has arrived. This buffering strat-egy requires that the message be copied twice (once on thesource node, to add a header specifying the destination ad-dress, and once at the target node, to move the data from thebuffer to the final destination).

The GASNet-MPI implementation could probably pro-vide better performance for large messages if it used theinitial receive buffer to inform the target CPU to set up areceive call that placed the data directly into the destination

memory, thus avoiding buffering. However, this would stillbe inherently slower than the elan-based direct RDMA ver-sion: the transaction would incur an additional rendezvous(in the form of a set of MPI messages) to set up the directreceive call (and likely an additional network round trip asMPI did its own internal rendezvous), followed by the directRMA transfer, and then a final MPI message from the des-tination processor informing the initiator that the messagehad completed (MPI does not provide built-in notification ofremote completion, but it is required for GAS language se-mantics, which sometimes need to guarantee target comple-tion for remote writes). The transaction would also need towait initially for the remote processor to poll the network tosee the initial setup message. For messages of a sufficientlylarge size, however, these costs ought to be amortized to thepoint where the MPI layer’s peak bandwidth performancemight asymptotically approach that of the elan layer (whenthe remote node is well-attentive to the network). It unclearwhat the crossover message size would be for this sort ofalgorithm, however, and it would need to be tuned for eachdifferent type of network (and possibly on a per-machinebasis), thus lessening the portability benefit of using MPI.

The performance of the naıve CG benchmark (whichuses an average message size of only 8 bytes) is less am-biguous. For applications which use small messages, elanprovides a significant speedup, allowing the application torun more than four times faster. Clearly the communica-tion performance obtained by directly targeting the native

7

network API is more suitable than a solution layered onMPI for supporting an incremental programming model forparallel applications and for inherently fine-grained appli-cations.

It is unclear if MPI 1.1 will ever be able to provide GASlanguages with the same performance as native networkAPIs for small and medium-sized one-sided messages. Thepopularity of MPI has meant that vendors are willing to ex-pend considerable effort tuning their implementations, andcertain trends are promising, such as the offloading of someof MPI’s message-passing logic onto dedicated network co-processors. Although such co-processors are not likely toimprove end-to-end message latency, they might lower thehost CPU’s software overheads to levels comparable to thatof lower-level network APIs, which might in turn allow aUPC implementation to hide the latency by performing un-related work in the interim.

3 MPI-RMA as a compilation target for GASlanguages

The most recent version of the MPI specification (MPI2.0) extends MPI 1.1 with a number of new features. In par-ticular, Chapter 6 of the specification [24] adds support for“one-sided communications” (also called “Remote Mem-ory Access (RMA)”). This extension supplements the tra-ditional two-sided MPI communication model with a one-sided interface that can take advantage of the capabilities ofRMA network hardware, in the hopes of lowering latencyand software overhead costs for applications written usinga shared-memory-like paradigm.

On the surface MPI-RMA thus seems like a natural fitfor GAS languages. However, the rest of this document ex-plains why MPI-RMA unfortunately does not meet the re-quirements laid out in section1.1 for implementing globaladdress space languages. Essentially, the strong restrictionsplaced on memory access patterns by the API and the weak-ness of its semantic guarantees make it unusable for GASlanguage implementation purposes. It is our sincere hopethat the observations presented here regarding the problem-atic semantic aspects of the interface can serve as a guideduring future revisions of the MPI-RMA specification – insection5 we provide some initial ideas for improving thesemantics to better meet the needs of the GAS languagecommunity.

It is important to note that we are interested in writingportable code, and are therefore not concerned with the be-havior of particular implementations of MPI-RMA (whichmay happen to relax some of these constraints, and/or havewell-defined semantics for conditions the specification la-bels as erroneous), but rather with the guarantees providedby any MPI-RMA implementation (including one whichaggressively exploits the intentionally under-specified as-

pects of the API).

3.1 Basics of the MPI-RMA API

The semantics of MPI-RMA are fairly complex. Herewe present only an overview of its usage, then proceed todiscuss those aspects of the API that affect GAS languageimplementations. The reader may consult the MPI 2.0 spec-ification [24] for full details – the subsequent discussion in-cludes references to the sections in the specification relevantto each major semantic issue.

The MPI-RMA API revolves around the use of abstractobjects called “windows” which intuitively specify regionsof a process’s memory which have been made availablefor remote operations by other MPI processes. Windowsare created using a collective operation (MPIWin create)called by all processes within a “communicator” (a group ofprocesses in MPI terminology), which specifies a base ad-dress and length (which may be different on each process),and is permitted to span very large areas (e.g., the entirevirtual address space). All three one-sided RMA operations(MPI Put, MPI Get, MPI Accumulate) take a reference tosuch a window, an offset into the window, and a rank in-teger to indicate which process is the remote target. Allone-sided operations are implicitly non-blocking and mustbe synchronized using one of the synchronization methodsdescribed below.

3.1.1 Active Target vs. Passive Target

There are 2 primary “modes” in which the one-sided APIcan be used, named “active target” and “passive target”. Theprimary semantic distinction is whether or not cooperationis required by the remote node in order to complete a remotememory access. All RMA operations on a window musttake place within a synchronization “epoch” (with a startand end point defined by explicit synchronization calls), andoperations are not guaranteed to be complete until the endof such an epoch. The active and passive target modes differin which process makes these synchronization calls.

Active target operation requires synchronization func-tions to be called on both the origin process (the one makingthe RMA get/put accesses) and the target process (the onehosting the memory in the referenced window). The ori-gin process calls MPIWin start/MPIWin complete to be-gin/end the synchronization epoch, and the target processmust cooperate by calling MPIWin post/MPIWin wait toacknowledge the beginning/end of the epoch (there is alsoa collective MPIWin fence operation which can be sub-stituted for one or more of these calls). In any case, thisrequired cooperation effectively destroys the possibility ofimplementing the truly one-sided operations that we wish toprovide in GAS languages using active-target mode RMA.

8

Passive target operation provides more lenient syn-chronization. In passive-target operation, only theoriginating process calls synchronization functions(MPI Win lock/MPI Win unlock) to start/end the accessepoch. As with active target, all RMA accesses must takeplace within such an epoch and are not guaranteed tocomplete until the MPIWin unlock call completes. Thereare two forms of MPIWin lock–shared and exclusive.MPI Win lock(exclusive) enforces mutual exclusion onthe window and the RMA operations performed withinthe epoch–i.e., it conceptually blocks until it can startan exclusive access epoch to the window, and no otherprocesses may enter a shared or exclusive access epoch forthat window until the process with exclusive access unlocks(the semantics are actually slightly weaker than this, butthe intuition is correct). MPIWin lock(shared) allowsother concurrent shared epochs from other processes. Thespecification recommends the use of exclusive epochswhen executing any local or RMA update operationson the memory encompassed by the window to ensurewell-defined semantics (section 6.4.3, p.131).

3.2 Restrictions on the Use of Passive Target RMA

The interface described thus far for passive target RMAseems reasonable, however unfortunately there are a largenumber of restrictions on how it may be legally used. Hereare some of the most important restrictions:

1. Window creation is a collective operation–all pro-cesses which intend to use a window for RMA (includ-ing all intended origin and target processes) must par-ticipate in the creation of that window (section 6.2.1,p.110).

2. Implementors may restrict the use of passive-targetRMA operations to only work on memory allocatedusing the “special” memory allocator MPIAlloc mem(section 6.4.3, p. 131). This prevents the useof passive-target RMA on static data and forces allglobally-visible objects to be allocated using this “spe-cial” allocation call (no guarantees are made about howmuch memory can be allocated using this call, andsome implementations may restrict it to a small num-ber of pinnable pages).

3. It is erroneous to have concurrent conflicting RMAget/put (or local load/store) accesses to the same mem-ory location (section 6.3, p.113).

4. The memory spanned by a window may not concur-rently be updated by a remote RMA operation anda local store operation (i.e., within a single accessepoch), even if these two updates access different (i.e.,non-overlapping) locations in the window (section 6.3,p.113).

5. Multiple windows are permitted to include overlap-ping memory regions, however it is erroneous to useconcurrent operations to distinct overlapping windows(section 6.2.1, p. 111).

6. RMA operations on a given window are only permit-ted to access the memory of single process during anaccess epoch (section 6.4.3, p.131)

3.3 Implications of the MPI 2.0 semantics

We now investigate the implications of the above restric-tions on our effort to implement remote accesses in a GASlanguage.

Restriction #1 implies that a GAS language implemen-tation cannot use a separate window per shared object,because that would require a collective operation for theallocation of shared objects, and shared object alloca-tion in GAS languages is often required to be a purelynon-collective, local operation. In order to support non-collective allocation, many or all shared objects would needto be coalesced within a single window. But while arbitrar-ily large regions (like the entire virtual address space) canbe mapped within a single window, this approach wouldseverely limit concurrency due to restriction #4. The sameshared object can be mapped into several windows, but dueto restriction #5 this approach is probably not useful.

Restriction #2 implies that all potentially shared objectsmust be allocated using MPIAlloc mem(). This restrictionalone may be enough to make MPI-RMA unsuitable formany GAS languages, unless MPI implementations permitlarge amounts of memory to be allocated using this func-tion (recall that a significant fraction of all data allocated byGAS programs is typically accessed remotely at some pointin program execution).

Restriction #6 means that an implementation wouldprobably need at least one separate window for each targetprocess, as otherwise RMA operations from one process todifferent target processes would be unnecessarily serialized.This may present scalability problems for applications us-ing a large number of nodes, depending on how windowsare implemented.

The underlying concept addressed by restriction #3 isfundamental to shared-memory programming and nothingnew. GAS languages generally specify that conflicting ac-cesses to a single memory location will store an undefinedresult to the location (for conflicting writes) or return an un-defined value (for the read in a read-write conflict). How-ever, the MPI restriction is unfortunately much stronger–itsays such conflicting accesses areerroneous, which impliesthat any resulting behavior after such a violation is possible(e.g., the MPI implementation would be within its rights toconsider this a fatal error). Unfortunately, it is not feasibleto statically detect all such conflicting accesses (from differ-ent processes) in user-provided code without application-

9

specific information (or perhaps even with it). Using agreat deal of compiler analysis, a conservative superset ofthe conflicting accesses could be generated, but in a weaklytyped language such as UPC, this is likely to include most ofthe accesses. It is impossible to detect such conflicts at run-time without global communication. In truth, the compileranalysis required simply to decide that it is safe for a singleprocess to include any other RMA accesses within the sameepoch as an RMA put or accumulate is non-trivial, althoughthis problem is an issue of sequential alias analysis which isconsiderably better understood. MPIWin lock(exclusive)can be used to conservatively prevent concurrent conflictingaccesses from distinct processes (by wrapping every RMAPut within its own exclusive epoch). However, because thislocks the entire window and serializes all access epochs tothat window, this would drastically reduce the concurrencyof accesses to distinct (i.e., non-conflicting) memory loca-tions that happen to reside within the same window (whichwould be very bad if the entire shared memory resided ina single window). This also effectively nullifies the abilityto perform non-blocking puts. Another option which mighthelp is to replace all RMA puts with RMA accumulate oper-ations where the reduction operation is “MPIREPLACE”–this has the same semantics as an RMA put, but conflictingaccumulate operations have well-defined semantics (theybehave as if the conflicting accumulates happened in someserial order)–however, conflicting RMA gets and local loadsto the same data would still be erroneous within the accessepoch, so an untenable amount of conflict analysis wouldstill be required.

Restriction #4 is particularly onerous for GAS languageimplementations. It prevents the local process from mak-ing any changes to memory that lies within a window dur-ing a remote access epoch to that window, even to differ-ent (i.e., non-conflicting) memory locations. This impliessome form of synchronization between the origin and tar-get process when accessing these locations, which impliestruly one-sided communication is not possible. The MPIspec recommends the local process perform all updates tothe local memory which falls within a window inside anexclusive epoch on that window. Because every access toa language-level local pointer in UPC and Titanium is po-tentially an access to a shared memory location residing lo-cally, in the absence of other informationeverylocal storeoperation would need to be wrapped within an exclusive ac-cess epoch (possibly prefixed with a check of whether theaccessed location resides in a window). Similarly, everylocal load which could potentially conflict with a remoteRMA put or accumulate to that location would need to bewrapped within a shared access epoch. The performanceimplications of adding such overheads to what should besimple local memory load/store instructions are staggering.Note that it is not sufficient for the local process to simply

maintain a permanent epoch on its own window, becausethis would prevent remote RMA operations on that windowfrom making progress. Finally, restriction #4 also impliesthat the entire virtual address space of a process can not beincluded in a window, because this would include privatememory that is constantly changing, such as the programstack.

The combination of these effects makes the MPI-RMAeffectively unusable by GAS language implementations: itis extremely unlikely that a layer could be written on top ofMPI-RMA that would provide an efficient implementationof GAS language communication semantics. Given that ad-herence to some of the MPI-RMA restrictions involve suchdifficult compiler problems as conflict and alias analysis, itit likely that many other parallel languages would also findMPI-RMA a difficult interface to target. It seems likely thatonly languages that effectively expose most of MPI-RMA’srestrictions and programming disciplines to the user at thelanguage level would find MPI-RMA a useful compilationtarget – however, users are unlikely to tolerate such rigidusability restrictions in any parallel language that exposes ashared-memory-like paradigm.

4 Related Work

A number of papers on MPI-RMA performance areavailable. Luecke and Hu [20] evaluated MPI-RMA perfor-mance on the Cray SV1, and Luecke et al. [21] comparedMPI-RMA performance to that of the vendor-suppliedSHMEM libraries on the Cray T3E and SGI Origin 2000 ar-chitectures, finding that the SHMEM interface significantlyoutperformed the MPI-RMA implementation available atthe time. Traff et al. [30] examined the performance ofMPI-RMA on NEC SX-5 system, and found its data trans-fer performance to be similar to that of the MPI 1.0 two-sided API. Matthey and Hansen [22] report that a produc-tion molecular dynamics simulation code ran 10-70% fasteron an Origin 2000 system after being converted from MPI-1to MPI-RMA.

Our strategy of implementing one-sided, asynchronousmessaging on top of MPI 1.1 is not a new one. Dobbelaere[9] reports implementing a custom one-sided communica-tion layer over MPI 1, and Booth and Mourao [6] discussimplementating the MPI-RMA API itself over MPI 1.1.

It appears from Smith [29] that the MPI-RMA pioneersdecided early on to require a separate set of allocationfunctions to allocate data that can be accessed by passive-target one-sided operations. For an alternative approachthat allows the entire virtual memory space to be accessedby RDMA (even for the tricky case of pinning-based net-works such as Myrinet), see the description of lazy registra-tion/pinning methods in [3].

Gropp [13] reviews some of the reasons for MPI’s en-during success in the parallel computing community.

10

5 Conclusion

We have shown that despite having certain desirablecharacteristics, neither the MPI 1.1 interface nor the morerecent MPI 2.0 RMA extensions make an attractive com-pilation target for Global Address Space languages. MPI1.1, despite its wide availability and often highly-tuned per-formance, imposes too many overheads with its two-sidedmessaging paradigm for GAS languages, particularly thosein which small message performance is important. Thenewer MPI-RMA API imposes too many semantic restric-tions to be a useful portable compilation target, at leastfor parallel languages which allow aliasing, data conflicts,and/or the illusion of a single, arbitrarily accessible sharedaddress space.

The fact that MPI-RMA makes a poor compilation tar-get for at least certain classes of parallel languages does notmean that the API is not useful to application programmers,who are capable of globally structuring their application toabide by the specification’s semantic restrictions. The pur-pose of this paper, in short, is not to malign MPI-RMA, butrather to explain why it is not suitable as a portable commu-nication layer for implementing parallel languages such asTitanium, UPC and Co-Array Fortran.

We sincerely hope that the MPI-RMA interface can berevised in future versions of the MPI specification to relaxsome of the problematic semantic restrictions described inthis paper. Perhaps with some careful changes made to ac-commodate the requirements of parallel high-performancelanguage implementors, it would become more useful as aportable compilation target for these languages. One pos-sibile approach is to introduce a third mode of operation inMPI-RMA to complement active-mode and passive-modethat discards the current problematic semantic restrictionsand provides a high-performance RMA interface similarto the one in GASNet [5] or ARMCI [26]. Our experi-ence with these layers indicates such an interface couldbe implemented efficiently on a wide range of communi-cation networks and would successfully expose the high-performance capabilities of the network hardware withoutcrippling the client with unreasonable semantic restrictions.However, any remaining portability concerns could easilybe addressed by making the new interface an optional com-ponent that need not be provided by all MPI implementa-tions. We believe the inclusion of such an option within afuture revision of the MPI specification would largely solvethe portability issues in implementing GAS languages andencourage communication innovation by allowing networkvendors to more usefully expose the one-sided RMA capa-bilities of their hardware.

References

[1] AMMPI home page. http://www.cs.berkeley.edu/∼bonachea/ammpi.

[2] D. Bailey, E. Barszc, J. Barton, D. Browning, R. Carter,L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson,T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan,and S. Weeratunga. The NAS parallel benchmarks. TechReport RNR-94-007, RNR, March 1994.http://www.nas.nasa.gov/Software/NPB/.

[3] C. Bell and D. Bonachea. A new DMA registration strat-egy for pinning-based high performance networks. InWork-shop Communication Architecture for Clusters (CAC03) ofIPDPS’03, Nice, France, 2002.

[4] C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Hus-bands, C. Iancu, M. Welcome, and K. Yelick. An evalua-tion of current high-performance networks. InIPDPS 2003,2003.http://upc.lbl.gov.

[5] D. Bonachea. GASNet specification, v1.1. Tech ReportUCB/CSD-02-1207, U.C. Berkeley, October 2002.http://www.cs.berkeley.edu/∼bonachea/gasnet.

[6] S. Booth and F. E. Mourao. Single sided MPI implementa-tions for SUN MPI. InSupercomputing, 2000.

[7] W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, andK. Yelick. A performance analysis of the Berkeley UPCcompiler. In17th Annual International Conference on Su-percomputing (ICS), 2003.http://upc.lbl.gov.

[8] Compaq UPC compiler. http://www.tru64unix.compaq.com/upc/.

[9] J. Dobbelaere and N. Chrisochoides. One-sidedcommunication over MPI-1. http://citeseer.nj.nec.com/dobbelaere01onesided.html.

[10] T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPCspecification, v1.1, March 2003.http://upc.gwu.edu.

[11] Elan programmer’s manual.http://www.quadrics.com.[12] GM reference manual. http://www.myri.com/scs/GM/doc/

refman.pdf.[13] W. D. Gropp. Learning from the success of MPI. InInterna-

tional Conference on High Performance Computing (HiPC2001), August 2001.

[14] P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Lib-lit, G. Pike, and K. Yelick. Titanium language referencemanual. Tech Report UCB/CSD-01-1163, U.C. Berkeley,November 2001.

[15] Infiniband trade association home page. http://www.infinibandta.org.

[16] A. Krishnamurthy and K. Yelick. Analyses and optimiza-tions for shared address space programs.Journal of Paralleland Distributed Computing, 38(2):130–144, 1996.

[17] Using the LAPI. http://www.research.ibm.com/actc/OptLib/LAPI Using.htm.

[18] B. Liblit and A. Aiken. Type systems for distributed datastructures. InConference Record of POPL ’00: The 27thACM SIGPLAN-SIGACT Symposium on Principles of Pro-gramming Languages, Boston, Massachusetts, 19–21 Jan.2000.

[19] B. Liblit, A. Aiken, and K. Yelick. Type systems for dis-tributed data sharing. InSAS ’03: The 10th InternationalStatic Analysis Symposium, Lecture Notes in Computer Sci-ence, San Diego, California, June 11–13 2003. Springer-Verlag.

11

http://www.cs.berkeley.edu/~bonachea/ammpi

http://www.cs.berkeley.edu/~bonachea/ammpi

http://www.nas.nasa.gov/Software/NPB/

http://www.nas.nasa.gov/Software/NPB/

http://upc.lbl.gov

http://www.cs.berkeley.edu/~bonachea/gasnet

http://www.cs.berkeley.edu/~bonachea/gasnet

http://upc.lbl.gov

http://www.tru64unix.compaq.com/upc/

http://www.tru64unix.compaq.com/upc/

http://citeseer.nj.nec.com/dobbelaere01onesided.html

http://citeseer.nj.nec.com/dobbelaere01onesided.html

http://upc.gwu.edu

http://www.quadrics.com

http://www.myri.com/scs/GM/doc/refman.pdf

http://www.myri.com/scs/GM/doc/refman.pdf

http://www.infinibandta.org

http://www.infinibandta.org

http://www.research.ibm.com/actc/Opt_Lib/LAPI_Using.htm

http://www.research.ibm.com/actc/Opt_Lib/LAPI_Using.htm

[20] G. R. Luecke and W. Hu. Evaluating the performance ofMPI-2 one-sided routines on a Cray SV1. Technical report,December 21, 2002.

[21] G. R. Luecke, S. Spanoyannis, and M. Kraeva. The perfor-mance and scalability of SHMEM and MPI-2 one-sided rou-tines on a SGI Origin 2000 and a Cray T3E-600. InPEMCS,December 2002.

[22] T. Matthey and J. Hansen. Evaluation of MPI’s one-sidedcommunication mechanism for short-range molecular dy-namics on the Origin 2000. InProceedings of PARA2000,The Fifth International Workshop on Applied Parallel Com-puting, 2000.

[23] S. P. Midkiff and D. A. Padua. Issues in the optimization ofparallel programs. InProceedings of the 1990 InternationalConference on Parallel Processing, 1990.

[24] MPI Forum. MPI–2: a message-passing interface standard.International Journal of High Performance Computing Ap-plications, 12:1–299, 1998. http://www.mpi-forum.org/docs/mpi-20.ps.

[25] MPI Forum. MPI: A message-passing interface standard,v1.1. Technical report, University of Tennessee, Knoxville,June 12, 1995.http://www.mpi-forum.org/docs/mpi-11.ps.

[26] J. Nieplocha and B. Carpenter. ARMCI: A portable re-mote memory copy library for distributed array libraries andcompiler run-time systems. InProc. RTSPP IPPS/SDP’99,1999.

[27] R. Numrich and J. Reid. Co-Array Fortran for parallel pro-gramming. InACM Fortran Forum 17, 2, 1-31., 1998.

[28] D. Shasha and M. Snir. Efficient and correct execution ofparallel programs that share memory. InACM Transactionson Programming Languages and Systems, 10(2):282–312,April 1988.

[29] A. G. Smith. Using MPI 2 one sided communications onCray T3D. Tech report, EPCC: The University of Edin-burgh, December 1995.http://citeseer.nj.nec.com/88176.html.

[30] J. Traff, H. Ritzdorf, and R. Hempel. The implementationof MPI–2 one-sided communication for the NEC SX. InProceedings of Supercomputing, 2000.

[31] K. Yelick, L. Semenzato, G. Pike, et al. Titanium: A high-performance Java dialect. ACM 1998 Workshop on Javafor High-Performance Network Computing, February 1998.http://titanium.cs.berkeley.edu.

12

http://www.mpi-forum.org/docs/mpi-20.ps



http://citeseer.nj.nec.com/88176.html

http://titanium.cs.berkeley.edu

Fast Parallel Solution For Set-Packing and Clique Problems by DNA-based Computing

Michael (Shan-Hui) Ho 1, Weng-Long Chang2, Minyi Guo3,

and Laurence Tianruo Yang4

1Department of Information Management Southern Taiwan University of Technology, Tainan County, Taiwan 710, R.O.C.

Phone Number: +886-6-2533131 Ext. 4300 E-mail: [email protected], [email protected]

Fax Number: +886−6−2541621

2Department of Information Management Southern Taiwan University of Technology, Tainan County, Taiwan 710, R.O.C.

Phone Number: +886-6-2533131 Ext. 4300 E-mail: [email protected], [email protected]

Fax Number: +886−6−2541621

Contact Author: Minyi Guo 3Department of Computer Software,

The University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan Phone Number: 0242-37-2557

E-mail: [email protected] Fax Number: 0242-37-2744

4Department of Computer Science,

St. Francis Xavier University P.O. Box 5000, Antigonish, B2G 2W5, NS, Canada

E-mail: [email protected]

Abstract

This paper shows how to use sticker to construct solution space of DNA for the library sequences in the set-packing problem and the clique problem. Then, with biological operations, we propose DNA-based algorithms to remove illegal solutions and to find legal solutions for the set-packing and clique problems from the solution space of sticker. Any NP-complete problem in Cook's Theorem [28] can be reduced and solved by the proposed DNA-based computing approach if its size is equal to or less than that of the set-packing problem. Otherwise, Cook's Theorem is incorrect on DNA-based computing and a new DNA algorithm should be developed from the characteristics of the NP-complete problem. Finally, the result to DNA simulation is given. Key Words − Molecular Parallel Computing, DNA-based Computing, NP-complete Problem, Cook’s Theorem.

1

1. Introduction

Through advances in molecular biology [1], it is now possible to produce roughly 1018 DNA strands that fit in a test tube. Those 1018 DNA strands can also be applied for representing 1018 bits of information. Basic biological operations can be used to simultaneously operate 1018 bits of information. Or we can say that 1018 data processors can be executed in parallel. Hence, it becomes obvious that biological computing can provide a huge parallelism for dealing with problems in the real world.

Adleman wrote the first paper, which demonstrated that DNA (DeoxyriboNucleic

Acid) strands could be applied for figuring out solutions of an instance in the NP-complete Hamiltonian path problem [2]. Lipton showed that the Adleman techniques could also be used to solve the NP-complete satisfiability problem [3]. Adleman and his co-authors proposed sticker for enhancing the Adleman-Lipton model [14].

In this paper, we use sticker to construct a solution space of DNA for the library

sequences of the set-packing and the clique problems. Simultaneously, we also apply DNA operations in the Adleman-Lipton model to develop DNA algorithms. The results of the proposed DNA algorithms shows that the set-packing and clique problems are resolved with biological operations in the Adleman-Lipton model for solution space of sticker. Furthermore, if the size of a reduced NP-complete problem is equal to or less than that of the set-packing problem, then the proposed algorithm for dealing with the set-packing problem can be used directly for solving the reduced NP-complete problem.

2. DNA Model of Computation

2.1. The Adleman-Lipton Model

It is cited from [1, 16] that DNA (DeoxyriboNucleic Acid) is a polymer, which is

strung together from monomers called deoxyriboNucleotides. Distinct nucleotides are detected using their bases, which come from adenine, guanine, cytosine and thymine. Those bases are abbreviated as A, G, C and T. It is pointed out from [1, 11, 16] that a DNA strand is essentially a sequence (polymer) of four types of nucleotides detected by one of the bases they contain. Under appropriate conditions two strands of DNA can form a double strand, if the respective bases are the Watson-Crick complements

2

of each other – A matches T and C matches G; also the 3’ end matches the 5’ end. The length of a single stranded DNA is the number of nucleotides comprising a single strand. Thus, if a single stranded DNA includes 20 nucleotides, we can say that it is a 20 mer (it is a polymer containing 20 monomers). The length of a double stranded DNA (where each nucleotide is base paired) is counted in the number of base pairs. Thus if we make a double stranded DNA from a single stranded 20 mer, then the length of the double stranded DNA is 20 base pairs, which can also be written as 20 bp. (For more discussion of the relevant biological background refer to [1, 11, 16].)

In the Adleman-Lipton model [2, 3], splints were used to construct the

corresponding edges of a particular graph, of paths, which represented all possible binary numbers. As it stands, their construction indiscriminately builds all splints that lead to a complete graph. This says that hybridization has a higher probability of errors. Hence, Adleman and his co-authors [14] proposed the sticker-based model, which was an abstract model of molecular computing based on DNA with a random access memory as well as a new form of encoding the information.

The DNA operations in the Adleman and Lipton model [2, 3, 11, 12] are

described below. These operations will be used for calculating solutions of the set-packing and clique problems.

The Adleman-Lipton Model:

A (test) tube is a set of molecules of DNA (i.e. a multi-set of finite strings over the alphabet A, C, G, T). Given a tube, one can perform the following operations:

1. Extract. Given a tube P and a short single strand of DNA called S, we can produce two tubes +(P, S) and −(P, S), where +(P, S) is all of the molecules of DNA in P which consist of the short strand S, and −(P, S) is all of the molecules of DNA in P which do not contain the short strand S.

2. Merge. Given tubes P1 and P2, yield ∪(P1, P2), where ∪(P1, P2) = P1 ∪ P2.

This operation is to pour two tubes into one, with no change in the individual strands.

3. Detect. Given a tube P, if it includes at least one DNA molecule we can say

‘yes’, and if it contains no DNA we can say ‘no’.

3

4. Append. Given a tube P and a short strand of DNA called Z, the operation will append the short strand Z onto the end of every strand in tube P.

5. Discard. Given a tube P, the operation will discard the tube P.

6. Read. Given a tube P, the operation is used to describe a single molecule,

which is contained in tube P. Even if P contains many different molecules, the operation can give an explicit description of only one of them.

2.2. Other Related Work and Comparison with the Adleman-Lipton Model

Based on solution space of splint in the Adleman-Lipton model, their methods [7, 17, 18, 19, 20] could be applied towards solving the traveling salesman problem, the dominating-set problem, the vertex cover problem, the clique problem, the independent-set problem, the 3-dimensional matching problem and the set-packing problem. The methods used for resolving the problems have exponentially increased volumes of DNA and linearly increased the time.

Bach et al. [33] proposed a n1.89n volume, Ο(n2 + m2) time molecular algorithm

for the 3-coloring problem and a 1.51n volume, Ο(n2m2) time molecular algorithm for the independent set problem, where n and m are, subsequently, the number of vertices and the number of edges in the problems resolved. Fu [21] presented a polynomial-time algorithm with a 1.497n volume for the 3-SAT problem, a polynomial-time algorithm with a 1.345n volume for the 3-Coloring problem and a polynomial-time algorithm with a 1.229n volume for the independent set. Though the size of those volumes [21, 33] is lower, constructing those volumes is more difficult and the time complexity to the methods is higher.

Quyang et al. [4] showed that enzymes could be used to solve the NP-complete

clique problem. Because the maximum number of vertices that they can process is limited to 27, the maximum number of DNA strands for solving the problem is 227 [4]. Shin et al. [8] presented an encoding scheme for decreasing the error rate of hybridization. This method [8] can be employed towards ascertaining the traveling salesman problem for representing integers and real values with fixed-length codes. Arito et al. [5] and Morimoto et al. [6] proposed a new molecular experimental techniques and a solid-phase method to find a Hamiltonian path. Amos [13] proposed a parallel filtering model for resolving the Hamiltonian path problem, the sub-graph

4

isomorphism problem, the 3-vertex-colorability problem, the clique problem and the independent-set problem. These methods [5, 6, 13] have lowered the error rate in real molecular experiments.

In literatures [26, 27, 30], the methods for DNA-based computing by

self-assembly required the use of DNA nanostructures called tiles to have efficient chemistries, expressive computational power, and convenient input and output (I/O) mechanisms. DNA tiles have lower error rate in self-assembly. Garzon et al [31] introduced a review of the most important advances in molecular computing.

Adleman and his co-authors [14] proposed a sticker-based model to enhance the

error rate of hybridization in the Adleman-Lipton model. Their model can be used for determining solutions of an instance in the set cover problem. Perez-Jimenez et al. [15] employed sticker-based model [14] to resolve knapsack problems. In our previous work, Chang et al [25, 32] also employed the sticker-based model and the Adleman-Lipton model for dealing with the dominating-set problem and the set-splitting problem for decreasing the error rate of hybridization.

3. Using Sticker for Solving the Set-Packing Problem in the Adleman-Lipton Model

3.1. Definition of the Set-Packing Problem

Assume that a finite set S is s1, …, sq, where sm is the mth element for 1 ≤ m ≤ q. Also suppose that |S| is the number of elements in S and |S| is equal to q. Assume that C is a collection of subsets to S and C is equal to C1, …, Cn, where each subset, Ck, is a subset to S for 1 ≤ k ≤ n. Also suppose that |C| is the number of subset in C and |C| is equal to n. Suppose that the number of subsets in C is greater than or equal to l, where l is a positive number. Suppose that set-packing for S is a sub-collection C1 ⊆ C such that C1 contains at least l mutually distinct sets [9, 10]. The set-packing problem is to find the maximum-size set-packing for S and has been proved to be the NP-complete problem [10].

In Figure 1, a finite set S is 1, 2 and a collection C is 1, 2 for S. The set S

and the collection C denote such a problem. The maximum-size set-packing for S is 1, 2. Hence, the size of the set-packing problem in Figure 1 is two. It is indicated from [10] that finding a maximum-size set-packing is a NP-complete problem, so it can be formulated as a “search” problem.

5

S = 1, 2 and C = 1, 2

Figure 1: The finite set and the collection of our problem.

3.2. Using Sticker for Constructing Solution Space of DNA Sequence for the Set-Packing Problem

In the Adleman-Lipton model, the main idea is to first generate solution space of DNA sequences for those problems resolved. Then, basic biological operations are used to remove illegal solutions and to find legal solutions from solution space. Therefore, the first step of resolving the set-packing problem is to produce a test tube, which contains all of the possible set-packing. Assume that an n-digit binary number corresponds to each possible set-packing to any n-subset collection, C. Also suppose that C1 is a set-packing for C. If the ith bit in an n-digit binary number is set to 1, then it represents that the ith subset is in C1. If the ith bit in an n-digit binary number is set to 0, then it represents that the corresponding subset is out of C1.

Assume that q one-digit binary numbers represent q elements in S. Also suppose that the mth one-digit binary number corresponds to the mth element in S. If the kth bit in n bits is set to 1 and the mth element in Ck is distinct from other elements in other subsets, then it represents the corresponding subset in C1 and the mth element in Ck is included in C1. Therefore, it is obvious that the mth one-digit binary number is appended onto the tail of those binary numbers, containing the value 1 of the kth bit. If the kth bit in n bits is set to 0, then it represents that every element in Ck is excluded from C1. This is to say that those elements in Ck are not appended onto the tail of those binary numbers, containing the value 0 of the kth bit.

To implement this, assume that an n-bit binary number Z is represented by a

binary number z1, …, zn, where the value of zk is 1 or 0 for 1 ≤ k ≤ n. A bit zk is the kth bit in an n-bit binary number Z and it represents the kth subset in C. Assume that q one-digit binary numbers are, subsequently, y1, …, yq. Assume that ym represents the mth element in S. An n-bit binary number Z includes all possible 2n choices of subsets. Each choice of subsets corresponds to a possible set-packing. If the value of zk is set to 1 and every element in the subset Ck is distinct from other elements in other subsets, then the value 1 for every element in Ck is, subsequently, appended onto the tail of those binary numbers, including the value 1 of the kth bit.

6

Consider the finite set S and the collection C in Figure 1. The finite set S is 1, 2

and the collection C is 1, 2. Table 1 denotes the solution space for Figure 1. The binary number 00 in Table 1 represents ∅. The binary numbers, 01 and 10, in Table 1 represent 2 and 1, respectively. The binary number 11 in Table 1 represents 1, 2. Though there are four 2-digit binary numbers for representing every possible set-packing, not every 2-digit binary number corresponds to a legal set-packing. Hence, in the next subsection, basic biological operations are used to develop an algorithm for removing the illegal solution and finding the legal solution.

2-digit binary number The corresponding set-packing

00 ∅

01 2

10 1

11 1, 2

Table 1: The solution space for Figure 1.

To represent all possible set-packing for the set-packing problem, sticker [14, 22]

is used to construct solution space for the problem solved. For every bit zk representing the kth subset in C, two distinct 15 base value sequences were designed. One represents the value “0” of zk and the other represents the value “1” of zk. For the sake of convenience in our presentation, assume that zk

1 denotes the value of zk to be 1 and zk

0 defines the value of zk to be 0. Similarly, for every bit ym representing the mth

element in S, two distinct 15 base value sequences are also designed. One represents the value, 1, for ym and the other represents the value, 0, to ym. For the sake of convenience in our presentation, also assume that ym

1 denotes the value of ym to be 1 and ym

0 defines the value of ym to be 0. Each of the 2n possible set-packing was represented by a library sequence of

15*(n + q) bases consisting of the concatenation of one value sequence for each bit. DNA molecules with library sequences are termed library strands and a combinatorial pool containing library strands is termed a library. The probes used for separating the library strands have sequences complementary to the value sequences.

It is pointed out from [14, 22] that errors in the separation of the library strands

7

are errors in the computation. Sequences must be designed to ensure that library strands have little secondary structure that might inhibit intended probe-library hybridization. The design must also exclude sequences that might encourage unintended probe-library hybridization. To help achieve these goals, sequences were computer-generated to satisfy the following constraint [22].

1. Library sequences contain only A's, T's, and C's. 2. All library and probe sequences have no occurrence of 5 or more consecutive

identical nucleotides. 3. Every probe sequence has at least 4 mismatches with all 15 base alignments of

any library sequence (except for its matching value sequence). 4. Every 15 base subsequence of a library sequence has at least 4 mismatches with all 15 base alignments of itself or any other library sequence. 5. No probe sequence has a run of more than 7 matches with any 8 base

alignments of any library sequence (except for its matching value sequence). 6. No library sequence has a run of more than 7 matches with any 8 base

alignments of itself or any other library sequence. 7. Every probe sequence has 4, 5, or 6 G’s in its sequence. Constraint (1) is motivated by the assumption that library strands composed only

of A’s, T’s, and C’s will have less secondary structure than those composed of A’s, T’s, C’s, and G’s [23]. Constraint (2) is motivated by two assumptions: first, those long homopolymer tracts may have unusual secondary structure and second, the melting temperatures of probe-library hybrids will be more uniform if none of the probe-library hybrids involve long homopolymer tracts. Constraints (3) and (5) are intended to ensure that probes bind only weakly where they are not intended to bind. Constraints (4) and (6) are intended to ensure that library strands have a low affinity for themselves. Constraint (7) is intended to ensure that the intended probe-library pairings have uniform melting temperatures.

The Adleman program [22] was modified for generating DNA sequences to

satisfy the constraints above. For example, the two subsets in C of Figure 1, the DNA sequences generated were: z1

0 = AATTCACAAACAATT, z20 =

ACTCCTTCCCTACTC, z11 = TCTCCCTATTTATTT and z2

1 = TCACCAAACCTAAAA. Similarly, for representing every element of S in Figure 1, the DNA sequences generated were: y1

0 = TCTCTCTCTAATCAT, y20 =

ACTCACATACACCAC, y11

= CCATCATCTACCTTA and y21 =

CAACCTATTATCTTA. For every possible set-packing in Figure 1, the corresponding

8

library strand was synthesized by employing a mix-and-split combinatorial synthesis technique [24]. Similarly, for any n-subset collection, all of the library strands for representing every possible set-packing could be also synthesized with the same technique. 3.3. The DNA Algorithm for Solving of the Set-Packing Problem

Algorithm 1: Solving the set-packing problem. (1) Input T0, where tube T0 is to encode all possible 2n choices of subsets in a collection C = C1, …, Cn and every element of each subset in the collection C comes from a finite set S = s1, …, sq. (2) For k = 1 to n, where n is the number of subsets in C.

(2a) TON = +(T0, zk1) and TOFF = −(T0, zk

1). (2b) For m = 1 to |Ck|, where |Ck| is the number of elements in the kth subset in C.

Assume that the mth element in Ck is sm and ym is used to represent it. (2c) TBAD = +(TON, ym

1) and TON = −(TON, ym1).

(2d) Discard(TBAD). (2e) Append(TON, ym

1). End For (2f) T0 = ∪(TON , TOFF).

End For (3) For k = 0 to n-1

For j = k down to 0 (3a) Tj+1

ON = +(Tj, zk+11) and Tj = −(Tj, zk+1

1). (3b) Tj+1 = ∪(Tj+1, Tj+1

ON). EndFor EndFor (4) For k = n down to 1

(a) If (detect (Tk) = 'yes') then (b) Read (Tk) and terminate the algorithm.

EndIf EndFor

Theorem 3-1: From those steps in Algorithm 1, the set-packing problem for a q-element set S and an n-subset collection C can be resolved. Proof:

9

A test tube of DNA strands, that represent all possible 2n input bit sequences z1, …,

zn, is yielded in Step 1. It is obvious that the test tube contains all possible 2n choices of set-packing.

According to the definition of set-packing [10], Step 2 will be at most executed n * q times for checking which subsets are disjoint. When the first execution of Step 2(a) uses the "extraction" operation to form two test tubes: TON and TOFF. The first tube TON includes all of the strands that have zk = 1, that is to say, the first subset in C occurs in tube TON. The second tube TOFF consists of all of the strands that have zk = 0, this is to say that the first subset in C does not appear in tube TOFF. Step 2(b) is an inner loop and is used to examine whether those elements in the kth subset are different from other elements in other subsets. On the first execution of Step 2(c), it uses the "extraction" operation to form two test tubes: TBAD and TON. The first tube TBAD includes all of the strands that have ym = 1. That is to say that the first element in the first subset in C has appeared in tube TBAD. The second tube TOFF contains all of the strands that have ym = 0. This means that the first element of the first subset in C does not appear in tube TON. From the definition of set-packing [10], tube TBAD includes illegal strands. Hence, Step 2(d) applies the “discard" operation to discard tube TBAD. Step 2(e) employs the "append" operation to append 15-based DNA sequences for representing the value 1 of ym into the tail of every strand in TON. Repeat execution of Steps 2(c) to 2(e) until every element in the first subset in C are processed. Then, Step 2(f) uses the "merge" operation to pour two tubes TON and TOFF into tube T0. This is to say that tube T0 currently contains the strands to represent every element in the first subset of C. Similarly, after other subsets are processed, tube T0 consists of those strands for representing a legal set-packing.

Each time the outer loop in Step 3 is executed; the number of executions for the

inner loop is (k + 1) times. On the first execution of the outer loop, the inner loop is only executed one time. Therefore, Steps 3(a) and 3(b) are executed one time. Step 3(a) uses the "extraction" operation to form two test tubes: T1

ON and T0. The first tube T1

ON contains all of the strands that have z1 = 1. The second tube T0 consists of all of the strands that have z1 = 0. That is to say that the first tube encodes every set-packing with the first subset in C and the second tube represents every set-packing without the first subset in C. Hence, Step 3(b) applies the "merge" operation to pour tube T1

ON into tube T1. After repeating execution of Steps 3(a) and 3(b), it finally produces n new tubes. Tube Tk for n ≥ k ≥ 1 then encodes those DNA strands that contain k subsets.

10

Because the set-packing problem is to find a maximum-size set-packing, tube Tn

first is detected using the "detection" operation in Step 4(a). If it returns "yes", then tube Tn at least contains a maximum-size set-packing. Therefore, Step 4(b) uses the "read" operation for describing the ‘sequence’ of a molecule in tube Tn and terminates the algorithm. Otherwise, continue to repeat execution of Step 4(a) until a maximum-size set-packing is detected in the tube.

The set S and the collection C in Figure 1 can be applied for showing the power of

Algorithm 1. In Figure 1, the finite set S is 1, 2 and the collection C is 1, 2. Assume that the collection C is C1, C2, where C1 and C2 are, 1 and 2 respectively. From Step 1 in Algorithm 1, tube T0 is filled with four library strands using the techniques mentioned in subsection 3.2, which represents four possible set-packing for S and C. For the first execution of Step 2(a) in Algorithm 1, two tubes are yielded. The first tube, TON, includes the numbers 1* (* can be either 1 or 0). The second tube, TOFF, contains the numbers 0*. That is to say that the first tube, TON, includes C1 and C1, C2 and the second tube, TOFF, contains Φ and C2. Then, for the first execution of Step 2(c), it uses the "extraction" operation to form two test tubes: TBAD and TON. This is to say that the first element, 1, in C1 has appeared in TBAD and the first element, 1, in C1 does not appear in TON. Step 2(d) applies the "discard" operation to discard tube TBAD. After Step 2(e) is executed, tube TON contains the bit string: 101(y1

1) and 111(y11). Next, after Step 2(f) is executed, tube T0 includes the bit

strings: 00, 01, 101(y11) and 111(y1

1). Similarly, after the second subset, C2 is processed, tube T0 consists of the bit strings: 00, 101(y1

1), 011(y21) and 111(y1

1) 1 (y2

1). The first execution of Step 3(a) in Algorithm 1 applies the "extraction” operation

to produce two tubes, T1ON and T0. Tube T1

ON contains the bit strings: 101(y11), and

111(y11) 1 (y2

1), while tube T0 includes the bit strings: 00 and 011(y21). Step 3(b)

applies the "merge” operation to pour tube T1ON into tube T1. Tube T1 includes the bit

strings, 101(y11), and 111(y1

1) 1 (y21), which represents two set-packing, C1 and C1,

C2. After repeating execution of Steps 3(a) and 3(b), it produces two new tubes. The new tube Tk for 2 ≥ k ≥ 1 encodes the legal set packing that contains k subsets, while tube T2 contains the bit strings, 111(y1

1) 1 (y21). Tube T1 includes the bit strings:

101(y11) and 011(y2

1). The set-packing problem is to find a maximum-size set-packing. The first time of

Step 4(a) is executed; tube T2 is first detected using the "detection" operation. This

11

step returns a “yes” for the “detection” operation for tube T2. Therefore, Step 4(b) applies the "read" operation to describe the ‘sequence’ of a molecule in tube T2 and terminates the algorithm. The maximum-size set-packing is found to be C1, C2.

3.4. The Complexity of Algorithm 1

Theorem 3-2: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with O(q*n + n2) biological operations in the Adleman-Lipton model. Proof:

Algorithm 1 can be applied for solving the set-packing problem for S and C. Algorithm 1 includes three main steps. Step 2 is mainly used to remove illegal library strands from all of the 2n possible library strands. From Algorithm 1, it is obvious that Steps 2(a) through 2(f) use (n + q*n) "extraction" operations, (q*n) "discard" operations, (q*n) "append" operations and n "merge" operations. Step 3 is mainly used to figure out the number of subsets in every legal set-packing. It is indicated from Algorithm 1 that Step 3(a) takes (n*(n+1)/2) "extraction" operations and Step 3(b) takes (n*(n+1)/2) "merge" operations. Step 4 is employed to find a maximum-size set-packing from legal set-packing. It is pointed out from Algorithm 1 that Step 4(a) at most takes n "detection" operations and Step 4(b) takes one "read" operation. Hence, from the statements above, it is inferred that the time complexity of Algorithm 1 is O(q*n + n2) biological operations in the Adleman-Lipton model. Theorem 3-3: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with O(2n) library strands in the Adleman-Lipton model. Proof: Refer to Theorem 3-2. Theorem 3-4: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with O(n) tubes in the Adleman-Lipton model. Proof: Refer to Theorem 3-2.

12

Theorem 3-5: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with the longest library strand, O(15*n + 15*q), in the Adleman-Lipton model. Proof: Refer to Theorem 3-2. 3.5. Range of Application to Cook’s Theorem on DNA-Based Computing

Cook’s Theorem [28] is that if one algorithm for one NP-complete problem is developed, then other problems will be solved by means of reduction to that problem. Cook’s Theorem has been demonstrated to be right using a general digital electronic computer. Lets assume that a graph G = (V, E), where V is the set of vertices and E is the set of edges. Also suppose that |V| and |E| are the number of vertices and the number of edges. Mathematically, a clique for a graph G = (V, E) is a subset V1 ⊆ V of vertices such that every two vertices in V1 are joined by an edge in E [9, 10, 29]. The clique problem is to find a maximum-size clique in G. The simple structure for the clique problem makes it one of the most widely used problems for other NP-complete results [9, 28, 29]. The following theorem is used to describe the range of application for Cook’s Theorem on DNA-based computing. Theorem 3-6: Assume that any other NP-complete problems can be reduced to the set-packing problem with a polynomial time algorithm in a general electronic computer. If the size of a reduced NP-complete problem is less than or equal to that of the set-packing problem, then Cook’s Theorem is correct on DNA-based computing. Proof: We transform the clique problem to the set-packing problem with a polynomial time algorithm [29]. Assume that a graph G = (V, E), where V is v1, v2, …, vn and E is (vp, vq)| (vp, vq) is an edge in G. Assume that |E| is at most (n * (n − 1)) / 2. Suppose that a graph G is any instance for the clique problem. We construct a finite set S and a collection C, where S = s1, …, s|E| and C = C1, …, C|V| such that S and C have a maximum-size set-packing if and only if G has a maximum-size clique.

From [29], each vertex uk in V corresponds to a subset Ck in C for 1 ≤ k ≤ n. Each edge in E also corresponds to an element sm in S for 1 ≤ m ≤ |E|. Each subset Ck in C for 1 ≤ k ≤ n is sm| sm represents an edge containing the vertex vk for 1 ≤ m ≤ |E|.

13

Setting S and C from G completes the construction of our instance to the set-packing problem. Therefore, the number of elements in S and C are |E| and n respectively. Algorithm 1 is used to determine the set-packing problem for S and C with 2 n DNA strands, O(|E| * n + n2) biological operations, O(n) tubes and the longest library strand, O(15*n + 15 * |E|). That is to say that Algorithm 1 can be applied to solve the clique problem by means of reducing it to the set-packing problem. Hence, it is derived that if the size of a reduced NP-complete problem is less than or equal to that of the set-packing problem, then Cook’s Theorem is correct on DNA-based computing.

From Theorem 3-6, if the size of a reduced NP-complete problem is equal to or less than that of the set-packing problem, then Algorithm 1 can be directly used for solving the reduced NP-complete problem. Otherwise, a new DNA algorithm should be developed from the characteristics of the NP-complete problem.

4. Using Sticker for Solving the Clique Problem in the Adleman-Lipton Model

4.1. The DNA Algorithm for Solving the Clique Problem

Suppose that a graph G = (V, E), where V is v1, v2, …, vn and E is (vp, vq)| (vp, vq) is an edge in G. Also Assume that |V| is n and |E| is at most (n * (n − 1)) / 2. Assume that a complementary graph G1 = (V, E1) for G, where E1 is (vc, vd)| (vc, vd) is out of E. Assume that |E1| is (n * (n − 1)) / 2 − |E|. A clique for a graph G = (V, E) is a complete sub-graph to G [29]. The clique problem is to find a maximum-size clique in G. Algorithm 2 is presented for solving the clique problem and the notations in Algorithm 2 are similar to those described in Section 3.2. Algorithm 2: Solving the clique problem. (1) Input (T0), where tube T0 is to encode all possible 2n choices of clique for a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. (2) For m = 1 to (n * (n − 1)) / 2 − |E|.

Assume that em is an edge in G1 and em = (vc, vd). Also suppose that zc and zd, respectively, represent vc and vd. (a) TON = +(T0, zc

1) and TOFF = −(T0, zc1).

(b) TON1 = +(TON, zd

1) and TOFF1 = −(TON, zd

1). (c) Discard(TON

1). (d) T0 = ∪(TOFF, TOFF

1).

14

EndFor (3) For k = 0 to n-1

For j = k down to 0 (3a) Tj+1

ON = +(Tj, zk+11) and Tj = −(Tj, zk+1

1). (3b) Tj+1 = ∪(Tj+1, Tj+1

ON). EndFor EndFor (4) For k = n down to 1

(a) If (detect (Tk) = 'yes') then (b) Read (Tk) and terminate the algorithm.

EndIf EndFor Theorem 4-1: From those steps in Algorithm 2, the clique problem for a graph G = (V, E) can be solved, where |V| is n and |E| is at most (n * (n − 1)) / 2. Proof:

A test tube of DNA strands, that represent all possible 2n input bit sequences z1, …,

zn, is yielded in Step 1. It is obvious that the test tube contains all possible 2n choices of clique.

From the definition of clique [10], Step 2 will be executed ((n * (n − 1)) / 2 − |E|) times for checking which strands are legal. When the first execution of Step 2(a) uses the "extraction" operation to form two test tubes: TON and TOFF. The first tube TON includes all of the strands that have zc = 1. That is to say that the cth vertex in V occurs in tube TON. The second tube TOFF consists of all of the strands that have zc = 0. This is to say that the cth vertex in V does not appear in tube TOFF. Then, at the first execution of Step 2(b), it uses the "extraction" operation to form two test tubes: TON

1 and TOFF1.

The first tube TON1 includes all of the strands that have zc = 1 and zd = 1. That is to say

that tube TON1 includes vertices, which are not connected in G. The second tube TOFF

1 contains all of the strands that have zc = 1 and zd = 0. This is to say that the cth vertex in V occurs in TOFF

1 and the dth vertex in V does not appear in TOFF1. From the

definition of clique [10], tube TON1 includes illegal strands. Hence, Step 2(c) applies

the "discard" operation to discard tube TON1. Next, Step 2(d) uses the "merge"

operation to pour the two tubes TOFF and TOFF1 into tube T0. This is to say that tube T0

currently contains legal solutions. After the remaining steps are processed, tube T0 consists of those strands for representing legal clique.

15

Each time the outer loop in Step 3 is executed; the number of executions for the

inner loop is (k + 1) times. On the first execution of the outer loop, the inner loop is only executed one time. Therefore, Steps 3(a) and 3(b) are executed once. Step 3(a) uses the "extraction" operation to form two test tubes: T1

ON and T0. The first tube T1ON

contains all of the strands that have z1 = 1. The second tube T0 consists of all of the strands that have z1 = 0. That is to say that the first tube encodes every clique with the first vertex in G and the second tube represents every clique without the first vertex in G. Then, Step 3(b) applies the "merge" operation to pour tube T1

ON into tube T1. After repeating execution of Steps 3(a) and 3(b), it finally produces n new tubes. The tube Tk for n ≥ k ≥ 1 encodes those DNA strands that contain k vertices.

Because the clique problem is to find a maximum-size clique, tube Tn first is

detected with the "detection" operation in Step 4(a). If it returns a "yes", then tube Tn at least contains a maximum-size clique. Therefore, Step 4(b) uses the "read" operation for describing ‘sequence’ of a molecule in tube Tn and terminating the algorithm. Otherwise, continue to repeat execution Step 4(a) until a maximum-size clique is found in the tube detected. 4.2. The Complexity of Algorithm 2

Theorem 4-2: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with O(n2 − |E|) biological operations in the Adleman-Lipton model. Proof:

Algorithm 2 can be applied for solving the clique problem for G. Algorithm 2 includes three main steps. Step 2 is mainly used to remove illegal library strands from all of the 2n possible library strands. From Algorithm 2, Steps 2(a) through 2(d) take (2 * ((n * (n − 1)) / 2 − |E|)) for the "extraction" operations, (n * (n − 1)) / 2 − |E|) for the "discard" operations and (n * (n − 1)) / 2 − |E|) for the "merge" operations. Step 3 is mainly used to figure out the number of vertices in every legal clique. It is indicated from Algorithm 2 that Step 3(a) takes (n*(n+1)/2) "extraction" operations and Step 3(b) takes (n*(n+1)/2) "merge" operations. Step 4 is employed to find a maximum-size clique from legal clique. It is pointed out from Algorithm 2 that Step 4(a) at most takes n "detection" operations and Step 4(b) takes one "read" operation. Hence, from the statements mentioned above, it is inferred that the time complexity of

16

the worst-case for Algorithm 2 is O(n2 − |E|) biological operations in the Adleman-Lipton model. Theorem 4-3: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with O(2n) library strands in the Adleman-Lipton model. Proof: Refer to Theorem 4-2. Theorem 4-4: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with O(n) tubes in the Adleman-Lipton model. Proof: Refer to Theorem 4-2. Theorem 4-5: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with the longest library strand, O(15*n), in the Adleman-Lipton model. Proof: Refer to Theorem 4-2. 4.3. The Comparison of Algorithm 2 with Algorithm 1 for Dealing with the Clique Problem

From Theorem 3-6, the clique problem for graph G = (V, E) can be directly reduced to the set-packing problem for a finite set S and a collection C. Algorithm 1 can be directly applied to the clique problem for graph G = (V, E) with 2 n DNA strands, O(|E| * n + n2) biological operations, O(n) tubes and the longest library strand, O(15 * n + 15 * |E|). From the characteristics of the clique problem for graph G = (V, E), Algorithm 2 is proposed to deal with the same problem of 2n library strands, O(n2 − |E|) biological operations, O(n) tubes and the longest library strand, O(15*n). From the statements above, solving the clique problem, graph G = (V, E), Algorithm 2 is better than Algorithm 1 in the form of biological operations and the longest library strand. This is to imply that a NP-complete problem should correspond to a new DNA algorithm. 5. Experimental Results of Simulated DNA Computing

17

We modified the Adleman program [22] using a Pentium(R) 4 and 128 MB of main memory. The operating system used is Window 98 and Visual C++ 6.0 compiler. The program modified was applied to generating DNA sequences for solving the set-packing problem and the clique problem. Because the source code of the two functions srand48() and drand48() was not found in the original Adleman program, we used the standard function srand() in Visual C++ 6.0 to replace the function srand48() and added the source code to the function drand48(). We also added subroutines to the Adleman program for simulating biological operations in the Adleman-Lipton model in Section 2. We added subroutines to the Adleman program to simulate Algorithm 1 and Algorithm 2. For the convenience of the presentation, simulation of Algorithm 1 is described below in detail. Simulation of Algorithm 2 is similar to that of Algorithm 1.

The Adleman program was used to construct each 15-base DNA sequence for

every bit of the library. For each bit, the program generates two 15-base random sequences (‘1’ and ‘0’) checking to see if the library strands satisfy the seven constraints in subsection 3.2 with the new DNA sequences added. If the constraints are satisfied, the new DNA sequences are ‘greedily’ accepted. If the constraints are not satisfied then mutations are introduced one by one into the new block until either: (A) the constraints are satisfied and the new DNA sequences are then accepted or (B) a threshold for the number of mutations is exceeded and the program has failed and so it exits, printing the sequence found so far. If (n + q)-bits that satisfy the constraints are found then the program has succeeded and it outputs these sequences.

Consider the finite set S and the collection C in Figure 1. The collection C includes two subsets: 1 and 2. The finite set S is 1, 2. DNA sequences generated by the modified Adleman program are shown in Table 2. The program took one mutation to make new DNA sequences for 1, 2, 1 and 2. With the nearest neighbor parameters, the Adleman program was used to calculate the enthalpy, entropy, and free energy for the binding of each probe to its corresponding region on a library strand. The energy used was shown in Table 3. Only G really matters to the energy of each bit. For example, the delta G for the probe binding a '1' in the first bit is 25 kcal/mol and the delta G for the probe binding a '0' is estimated to be 24.3 kcal/mol.

Bit 5’→ 3’ DNA Sequence z1

0 AATTCACAAACAATT z2

0 ACTCCTTCCCTACTC z1

1 TCTCCCTATTTATTT

18

z21 TCACCAAACCTAAAA

y10 TCTCTCTCTAATCAT

y20 ACTCACATACACCAC

y11 CCATCATCTACCTTA

y21 CAACCTATTATCTTA

Table 2: Sequences chosen were used to represent the two subsets in C and every element of S in Figure 1.

Bit Enthalpy energy (H) Entropy energy (S) Free energy (G) z1

1 114.4 299.4 25 z1

0 107.8 278.6 24.3 z2

1 111.5 284.8 26.2 z2

0 109.1 279 25.9 y1

1 105.2 270.5 24.4 y1

0 97.3 252.3 22.1 y2

1 107 282.4 22.5 y2

0 94.7 239.8 22.8

Table 3: The energy for binding of each probe to its corresponding region on a library strand.

The program simulated a mix-and-split combinatorial synthesis technique [24] to synthesize the library strand to every possible set-packing. These library strands are shown in Table 4 and represent every possible set-packing: ∅, 2, 1 and 1, 2. The program also figured out the average and standard deviation for the enthalpy, entropy and free energy over all probe/library strand interactions. The energy levels are shown in Table 5. The standard deviation for delta G is small because this is partially enforced by the constraint that there are 4, 5, or 6 G’s (the seventh constraint in subsection 3.2) in the probe sequences.

∅ 5'-AATTCACAAACAATTACTCCTTCCCTACTC-3' 3'-TTAAGTGTTTGTTAATGAGGAAGGGATGAG-5'

2 5'-AATTCACAAACAATTTCACCAAACCTAAAA-3' 3'-TTAAGTGTTTGTTAAAGTGGTTTGGATTTT-5'

1 5'-TCTCCCTATTTATTTACTCCTTCCCTACTC-3' 3'-AGAGGGATAAATAAATGAGGAAGGGATGAG-5'

12 5'-TCTCCCTATTTATTTTCACCAAACCTAAAA-3' 3'-AGAGGGATAAATAAAAGTGGTTTGGATTTT-5'

Table 4: DNA sequences chosen were used to represent every possible set-packing in tube T0.

19

Enthalpy energy (H) Entropy energy (S) Free energy (G) Average 105.875 273.35 24.15

Standard deviation 6.31031 17.7764 1.45002

Table 5: The energy over all probe/library strand interactions.

The Adleman program was employed for computing the distribution of the different types of potential mishybridizations. The distribution of the types of potential mishybridizations is the absolute frequency of a probe-strand match of length k from 0 to the bit length 15 (for DNA sequences) where probes are not supposed to match the strands. The distribution was, subsequently, 120, 209, 265, 378, 418, 396, 279, 147, 78, 32, 16, 0, 0, 0, 0 and 0. It is pointed out from the last five zeros that there are 0 occurrences where a probe matches a strand at 11, 12, 13, 14 or 15 places. This shows that the third constraint in subsection 3.2 has been satisfied. Hence, the number of matches peaks at 4(418). That is to say that there are 418 occurrences where a probe matches a strand at 4 places. Bit string to solution 5’→ 3’ DNA Sequence

00 5'-AATTCACAAACAATTACTCCTTCCCTACTC-3' 101(y1

1) 5'-TCTCCCTATTTATTTACTCCTTCCCTACTCCCATCATCTACCTTA-3'

011(y21) 5'-AATTCACAAACAATTTCACCAAACCTAAAACAACCTATTATCT

TA-3' 111(y1

1) 1(y2

1) 5'-TCTCCCTATTTATTTTCACCAAACCTAAAACCATCATCTACCTTACAACCTATTATCTTA-3'

Table 6: DNA sequences generated by Step 2 were applied to represent legal solutions in tube T0.

Tube Bit string to solution 5’→ 3’ DNA Sequence

T0 00 5'-AATTCACAAACAATTACTCCTTCCCTACTC-3'

T1 101(y1

1) 5'-TCTCCCTATTTATTTACTCCTTCCCTACTCCCATCATCTACCTTA-3'

011(y21) 5'-AATTCACAAACAATTTCACCAAACCTAAAACAACCTATTATCT

TA-3' T2 111(y1

1) 1(y2

1) 5'-TCTCCCTATTTATTTTCACCAAACCTAAAACCATCATCTACCTTACAACCTATTATCTTA-3'

Table 7: DNA sequences generated by Step 3 were employed to represent legal solutions in tubes T0, T1 and T2.

20

Bit string To solution 5’→3’ DNA sequence

111(y11)

1(y21)

5'-TCTCCCTATTTATTTTCACCAAACCTAAAACCATCATCTA CCTTACAACCTATTATCTTA-3'

Table 8: The answer was found from Step 4 in tube T2.

The results for simulation of Steps 2 through 4 are shown in Tables 6, 7 and 8.

From tube T2, the answer was found to be 1, 2.

6. Conclusions and Future Works

The proposed algorithms (Algorithm 1 and 2) for solving the set-packing and clique problems are based on biological operations in the Adleman-Lipton model and the solution space of stickers in the sticker-based model. The present algorithms have several advantages from the Adleman-Lipton model and the sticker-based model. First, the proposed algorithm actually has a lower rate of errors for hybridization because we modified the Adleman program to generate good DNA sequences for constructing the solution space of stickers to the set-packing and clique problems. Only simple and fast biological operations in the Adleman-Lipton model were employed to solve the problems. Secondly, those biological operations in the Adleman-Lipton model had been performed in a fully automated manner in their lab. The full automation manner is essential not only for the speedup of computation but also for error-free computation. Thirdly, in Algorithm 1 for solving the set-packing problem, the number of tubes, the longest length of DNA library strands and the number of DNA library strands, respectively, are O(n), O(15*n + 15*q) and O(2n). In Algorithm 2 for solving the clique problem, the number of tubes, the longest length of DNA library strands and the number of DNA library strands, respectively, are O(n), O(15*n) and O(2n). This implies that the present algorithms can be easily performed in a fully automated manner in a lab. Furthermore, the present algorithms generate 2n library strands, which satisfy the seven constraints in subsection 3.2, which corresponds to 2n possible solutions. This allows the present algorithms to be applied to a larger instance of the set-packing and clique problems. Fourthly, from the statements in subsection 4.3, for solving the clique problem, Algorithm 2 is better than Algorithm 1 in the form of biological operations and the longest library strand. Therefore, this seems to imply that Cook’s Theorem is unsuitable on a DNA-based computer and a NP-complete problem should correspond to a new DNA algorithm.

Currently, there are lots of NP-complete problems that cannot be solved because it

21

is very difficult to support basic biological operations using mathematical operations. We are not sure whether molecular computing can be applied to dealing with every NP-complete problem. Therefore, in the future, our main work is to solve other NP-complete problems that were unresolved with the Adleman-Lipton model and the sticker model.

22

References

[1] R. R. Sinden. DNA Structure and Function. Academic Press, 1994.

[2] L. Adleman. Molecular computation of solutions to combinatorial problems. Science, 266:1021-1024, Nov. 11, 1994.

[3] R. J. Lipton. DNA solution of hard computational problems. Science, 268:542:545, 1995.

[4] Q. Quyang, P.D. Kaplan, S. Liu, and A. Libchaber. DNA solution of the maximal clique problem. Science, 278:446-449, 1997.

[5] M. Arita, A. Suyama, and M. Hagiya. A heuristic approach for Hamiltonian path problem with molecules. Proceedings of 2nd Genetic Programming (GP-97), 1997, pp. 457-462.

[6] N. Morimoto, M. Arita, and A. Suyama. Solid phase DNA solution to the Hamiltonian path problem. DIMACS (Series in Discrete Mathematics and Theoretical Computer Science), Vol. 48, 1999, pp. 93-206.

[7] A. Narayanan, and S. Zorbala. DNA algorithms for computing shortest paths. In Genetic Programming 1998: Proceedings of the Third Annual Conference, J. R. Koza et al. (Eds), 1998, pp. 718--724.

[8] S.-Y. Shin, B.-T. Zhang, and S.-S. Jun. Solving traveling salesman problems using molecular programming. Proceedings of the 1999 Congress on Evolutionary Computation (CEC99), vol. 2, pp. 994-1000, 1999.

[9] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms.

[10] M. R. Garey, and D. S. Johnson. Computer and intractability. Freeman, San Fransico, CA, 1979.

[11] D. Boneh, C. Dunworth, R. J. Lipton and J. Sgall. On the computational Power of DNA. In Discrete Applied Mathematics, Special Issue on Computational Molecular Biology, Vol. 71 (1996), pp. 79-94.

[12] L. M. Adleman. On constructing a molecular computer. DNA Based Computers, Eds. R. Lipton and E. Baum, DIMACS: series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society. 1-21 (1996)

[13] M. Amos. “DNA Computation”, Ph.D. Thesis, department of computer science, the University of Warwick, 1997.

[14] S. Roweis, E. Winfree, R. Burgoyne, N. V. Chelyapov, M. F. Goodman, Paul W.K. Rothemund and L. M. Adleman. “A Sticker Based Model for DNA

23

Computation”. 2nd annual workshop on DNA Computing, Princeton University. Eds. L. Landweber and E. Baum, DIMACS: series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society. 1-29 (1999).

[15] M.J. Perez-Jimenez and F. Sancho-Caparrini. “Solving Knapsack Problems in a Sticker Based Model”. 7nd annual workshop on DNA Computing, DIMACS: series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society, 2001.

[16] G. Paun, G. Rozenberg and A. Salomaa. DNA Computing: New Computing Paradigms. Springer-Verlag, New York, 1998. ISBN: 3-540-64196-3.

[17] W.-L. Chang and M. Guo. "Solving the Dominating-set Problem in Adleman-Lipton’s Model". The Third International Conference on Parallel and Distributed Computing, Applications and Technologies, Japan, 2002, pp. 167-172.

[18] W.-L. Chang and M. Guo. " Solving the Clique Problem and the Vertex Cover Problem in Adleman-Lipton’s Model". IASTED International Conference, Networks, Parallel and Distributed Processing, and Applications, Japan, 2002, pp. 431-436. [19] W.-L. Chang, M. Guo. "Solving NP-Complete Problem in the Adleman-Lipton Model". The Proceedings of 2002 International Conference on Computer and Information Technology, Japan, 2002, pp. 157-162. [20] W.-L. Chang and M. Guo. " Resolving the 3-Dimensional Matching Problem and the Set-Packing Problem in Adleman-Lipton’s Model". IASTED International Conference, Networks, Parallel and Distributed Processing, and Applications, Japan, 2002, pp. 455-460.

[21] Bin Fu. "Volume Bounded Molecular Computation". Ph.D. Thesis, Department of Computer Science, Yale University, 1997.

[22] Ravinderjit S. Braich, Clifford Johnson, Paul W.K. Rothemund, Darryl Hwang, Nickolas Chelyapov and Leonard M. Adleman. "Solution of a satisfiability problem on a gel-based DNA computer". Proceedings of the 6th International Conference on DNA Computation in the Springer-Verlag Lecture Notes in Computer Science series.

[23] Kalim Mir. “A restricted genetic alphabet for DNA computing”. Eric B. Baum and Laura F. Landweber, editors. DNA Based Computers II: DIMACS Workshop, June 10-12, 1996, volume 44 of DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, Providence, RI, 1998, pp. 243-246.

[24] A. R. Cukras, Dirk Faulhammer, Richard J. Lipton, and Laura F. Landweber. “Chess games: A model for RNA-based computation”. In Proceedings of the 4th DIMACS Meeting on DNA Based Computers, held at the University of Pennsylvania, June 16-19, 1998, pp. 27-37.

24

[25] W.-L. Chang and M. Guo. "Using Sticker for Solving the Dominating-set Problem in the Adleman-Lipton Model". IEICE Transaction on Information System, 2003, accepted. [26] Reif, J.H., LaBean, T.H. and Seeman. "Challenges and Applications for Self-Assembled DNA-Nanostructures". Proceedings of the 6th DIMACS Workshop on DNA Based Computers (meeting at Leiden, Holland, June 2000). [27] T. H. LaBean, E. Winfree and J.H. Reif. "Experimental Progress in Computation by Self-Assembly of DNA Tilings". Theoretical Computer Science, Volume 54, pp. 123-140, (2000).

[28] Cook, S. A. “The Complexity of Theorem-proving Procedures”. Proc. 3rd Ann. ACM Sym. On Theory of Computing, ACM, New York, pp. 151-158, 1971.

[29] Karp, R. M. “Reducibility among combinatorial Problems”. Complexity of Computer Computation, Plenum Press, New York, pp. 85-103, 1972.

[30] M.C. LaBean, T.H. Reif and J.H Seeman."Logical computation using algorithmic self-assembly of DNA triple-crossover molecules". Nature, 407, pp., 493-496.

[31] M. H. Garzon and R. J. Deaton. "Biomolecular Computing and Programming". IEEE Transactions on Evolutionary Computation, Volume 3, pp. 236-250, 1999.

[32] W.-L. Chang, M. Guo and M. Ho. " Solving the Set-splitting Problem in Sticker-based Model and the Adleman-Lipton Model". The 2003 International Symposium on Parallel and Distributed Processing and Applications, Aizu City, Japan, 2003, accepted.

[33] E. Bach, A.Condon, E. Glaser and C. Tanguay. "DNA models and algorithms for NP-complete problems". In Proceedings of the 11th Annual Conference on Structure in Complexity Theory, pp. 290-299, 1996.

25

1

HPM: A Hierarchical Model for Parallel Computations*

Xiangzhen Qiao & Shuqing Chen

The Institute of Computing Technology, Chinese Academy of Sciences Beijing, P R China 100080

Abstract A hierarchical model for parallel computations is introduced and evaluated in this paper. This

model describes the general homogeneous parallel computer systems with Hierarchical Parallelism and

hierarchical Memories (named as HPM). A parallel function HP describes the multi-level parallelism of the

system. The memory function HM shows the characteristics of the hierarchical memories. The binding

relation HB gives the connection relation of parallelism and hierarchical memories. Some commercial

parallel computers are illustrated according to the HPM model. The performance of parallel algorithms is

analyzed from the parallelism and locality point of view.

Key Words: HPM, E-RAM, Binding relation, Vertical points, Crossing points.

1. Introduction

A fundamental challenge in the parallel processing is to develop effective models for the

parallel computation. A number of models for parallel algorithm design and analysis have

been proposed, refined and studied in the last 20 years [1]. Primary among them are: PRAM [2],

BSP[3] and LogP[4]. Several models are primarily concerned with the memory hierarchy,

which focus on accounting for various levels of the storage hierarchy [2].

The standard model, the Random Access Machine (RAM) [1], does not represent memory

hierarchies. It assumes unit-time access to all memory cells, regardless of large cost

differences between accesses to the cache, main memory, and disks. The differences are

unlikely to disappear in the future. For most algorithms, the RAM assumption is too

inaccurate, and some alternative models have been proposed, such as the uniform memory

hierarchy model (UMH) [5]. Memory-hierarchy models have been extended by parallelism,

and parallel models have been extended by a representation of memory hierarchy [5].

A hierarchical model for parallel computations is introduced in this paper. This model

describes the general homogeneous parallel computer systems with Hierarchical Parallelism

and hierarchical Memories (HPM). A parallel function HP is defined to describe the

* Supported by the National Natural Science Foundation of China (Grant 69933020).

2

multi-level parallelism of the system. The function HM is used to show the characteristics of

hierarchical memories. The topology of the system is not addressed, but a binding relation HB

is defined to describe the connection of parallelism and hierarchical memories. Four

parameters are used to analyze the performance of the HPM model: p (number of processors);

s (peak performance of a processor in the HPM); k and k+

(decreasing ratio of the

performance of the memory hierarchy to the cpu). Some commercial parallel computers are

illustrated according to the HPM model. The parallel HPM algorithm is defined in this paper.

A parallel HPM algorithm consists of a set of p independent sub-algorithms, each running on

a processor of an HPM system. These sub-algorithms are interacted through some interactive

points. The sub-algorithms are different from the general RAM algorithm, and we call them

as the Enhanced RAM(E-RAM) Algorithm: a traditional RAM algorithm with the

consideration of the parallelism and memory hierarchies inside a cpu. The performance of

parallel algorithms is analyzed from the parallelism and locality point of view. The

performance of an HPM algorithm depends on the E-RAM algorithm, the parallelism behavior

of the algorithm, and the influence of the computing (parallel) interactive points and data

interactive points. The real performance of the algorithm will depend on both sides of the

architecture and algorithm characteristics.

Compared with other memory hierarchy models, the most significant characteristics of

the HPM model is that it is computing-centric (parallelism - centric). Parallelism is the

everlasting factor to influence the performance of parallel algorithms. The second is that, the

connection of parallelism and the locality is given in a relation HB. The influence of memory

hierarchies on the computation can easily be analyzed from HB. Third, the “synchronism” is

not defined in this model. In fact, it is an asynchronous model. The synchronism is hidden in

the data communication between the processors. Fourth, the parallel systems are described

from the “parallelism” and the “locality”. If only the “parallelism” is concerned, the HPM

model is equivalent to a PRAM model. Most of the present parallel algorithms are designed in

the PRAM model. Thus many HPM algorithms can be transferred from PRAM algorithms. In

this paper, we mainly discuss the model and its performance analysis based on scientific

computing.

The remainder of the paper is organized as follows. The HPM model is defined in the

3

section 2, some examples are given. The parallel algorithms based on the HPM model are

defined in the section 3. The performance of parallel algorithms is analyzed in section 4. The

paper ends with the conclusion.

2. The HPM model

The HPM model, denoted as HPM (hp, hm, hB), describes the general homogeneous

parallel computer systems with Hierarchical Parallelism and hierarchical Memories. It defines

a parallel computer system from 3 respects.

2.1 Hierarchical parallelism

A parallel function HP describes the multi-level parallelism of an HPM system:

HP = p0, p1, ……, php ……………………………… (1)

Here hp is the number of the levels of the parallelism in the HPM system. p0 is the parallelism

inside a cpu, including the characteristics of pipeline, multiple function units, chaining, and so

on. In fact, p0 is equal to the number of floating point operations, which can be operated by a

processor in one clock cycle. On the i-th (1ihp) level of the parallelism, there are multiple

homogeneous sub-systems. The “macro parallelism” of a sub-system on the i-th level on the

HPM system is given by pi (1ihp). In general, on the i-th (2i<hp) level, there are ps1

(ps1=php*php-1 *……*pi) sub-systems each with parallelism ps2 (ps2=p1*p2……*pi-1).

On the hp-th level of the parallelism, the parallel system has php sub-systems, each with

parallelism ps2 (ps2=p1*……*php-1). When i=1, ps1=p (p=p1*p2……*php), ps2=1. For

simplicity and without loss of generality, here we only discuss homogeneous systems.

If the peak performance of a processor in the HPM(hp,hm,hB) system is s, then the peak

performance Shpm of above HPM will be Shpm= s * p1* p2* … * php = s * p.

2.2 Hierarchical memories and binding relation

The memory system of HPM model is consisted of multi-level hierarchical memories.

For the sequence = MUk (1khm) of memory modules, is a memory hierarchy.

There are total hm levels of hierarchical memories. Module MUk can hold multiple blocks, a

“bus” BUSk connects the MUk to the next module MUk+1. Blocks are further partitioned into

sub-blocks. The sub-block is the unit of transfer along the bus from the previous module. We

call the sub-block as basic block in this paper. All buses can be active concurrently. The only

4

restriction is that no data item in the block being transferred on BUSi is available to buses

BUSk-1 or BUSk+1 until the transfer of the entire block is completed. Here MU1 can be the

caches, MU2 main memory, MU3 hard disk …. Three important characteristics are required for

the messages stored in MUk: coherency, inclusion, and locality [7]. Generally speaking, the

transformation of the messages between the adjacent levels of the memory hierarchy is

consecutive. The nearer the cpu, the higher transfer speed, and smaller capacity of the

memory module. Thus, the hit rate and the reuse possibility are important for the memory

system [7].

A binding relation HB describes the connection relation of parallelism and locality:

HB = hl1, hl2, … , hlhp ……………………… (2)

Here hli (1ihp) describes the data transfer style between the pi subsystems on the i-th

parallel level. In general, hli can take two kinds of values: positive integer k (1khm)

means the information exchange between the pi subsystems is through data accesses to a

common shared memory (based on MUk). The hli can also be the value of k+, which means

the data transfer between the pi subsystems is through message passing in a network

connecting the pi subsystems. The network is based on the k-th level of the memory hierarchy

- MUk of every pi subsystems.

2.3 The performance of the memory hierarchy

In this paper we take the general L, B analyses for memory and communication timing.

Thus, the performance of the memory hierarchy for HPM model can be expressed by:

fj lj, Bj, Lj, gj(qj) ……………………… (3)

or

fj+ lj

+, Bj+, Lj

+, gj+ (qj) ……………………… (4)

In the equation (3), we have 1 memory

hierarchy. Uj is the capacity of MUj, lj is the length of the basic block for hierarchy MUj, Bj is

the bandwidth of the data transfer between MUj and MUj+1, Lj is the latency (including the

hardware and software latency). The gj (qj) is the penalty of memory accesses conflicts. If the

memory is private for a processor, we have gj (qj) = 0, otherwise gj (qj) 0.

For equation (4), we have hm+1, here hc is the number of hlj, taking value j+

(1j. The lj+

is the length of the basic block between MUj of different sub-systems,

5

which date communication is based on memory hierarchy j. Bj+

is the bandwidth of the data

transfer. Lj+ is the latency, and gj

+ (qj) is the penalty of communication conflicts.

In general, the time for transferring the data with length nl between memory hierarchy

MUj and MUj+1 can be expressed by the following equation:

tcj=([nl/lj]* lj)/ Bj + Lj ……………………… (5)

Here, 1j. For the same data, the time for message passing on the j-th level will be

(hm<j

tcj+=([nl/lj

+]*lj+)/ Bj

+ + Lj+ ……………………… (6)

Generally, we have: Uj-1<Uj (1<jj-1<lj, Bj-1>Bj, Lj-1<Lj, and Bj>Bj+, Lj<Lj

+.

2.4 Examples of HPM model

The HPM model can be used to describe general commercial computer systems. For

example:

A common sequential computer can be described as: HPM(0, 3, hB), HP = p0, p0 1.

This system consisted of a cpu and memory system. The memory hierarchies can be caches,

main memory and hard disk. When the computer has parallelism inside the cpu, that is,

multiple floating point operations can be completed in one clock cycle, p0 > 1. For example, to

a PowerPc 604e computer, we have p02. HB is not defined for a sequential computer.

The Symmetric shared memory multiprocessor (SMP) system can be defined by HPM

model as HPM (1, 3, hB), HP = p0, p1. Here p0 has the same meaning as above, p1 is the

number of cpu‘s in an SMP. The memory hierarchies of an SMP are MUi (1

corresponding to cache, main memory and hard disk respectively. HB2, this means that

the information transferring between the processors are through the second level of the

memory hierarchies – common shared memory. For a IBM Power II-270 SMP[8]

multiprocessor, p02p12. For a Power3-II[8] SMP multiprocessor, p04p14.

The distributed memory parallel computers and cluster of computers can be described by

the HPM model as: HPM(1, 3, hB), and HP = p0, p1. Here p0 has the same meanings as

above. The number of processors in the system is given by p1. The binding relation is HB =

2+. This means that the information transfer between the processors is completed through

the network, and the network is based on the 2nd memory hierarchy – the private main

memory of every processor. For example, for the Dawning D1000A [8]p0 = 2p1 = 8. For

6

Dawning 2000-1, p0 = 2p1 = 32.

The clusters of SMPs (abbreviated as CoSMPs in this paper) using fast networks, such as

the Myricom's Myrinet [9], have emerged as important platforms for high performance

computing [10]. CoSMPs has two levels of parallelism. The hierarchy memories consist of

caches, main memory and hard disk. Thus, its HPM model can be denoted as HPM(2,3,hB),

and HP = p0,p1,p2, where p0 is the parallelism inside a cpu. p1 is the parallelism of an SMP

sub-system. p2 is the parallelism on the cluster level, that is the number of SMP sub-system in

the whole system. Here we call an SMP sub-system as a “node”. The total parallelism of the

system is pp1*p2. The binding relation of CoSMP is expressed as HB22. The

information transferring between the sub-systems of the first level of the parallelism (inside

an SMP) is through the second level of the memory hierarchy. Each processor in an SMP

exchanges the information with other processors by writing (or reading) the information to (or

from) the common shared memory. The information transferring between the sub-systems of

the second level of the system (on the cluster level – between the SMP nodes) is completed

through the network, and the network is based on the 2nd memory hierarchy – the private main

memory in every nodes. For example, the Dawning D3000 [8] is a CoSMPs system. It consists

of multiple workstation nodes. Each node includes a quad-processor IBM Power3-II

microprocessor. Every processor has double function units. A multiply-add floating operation

can be completed in one clock cycle by one function unit. The interconnection Network -

Myrinet is used for high-speed communication. Thus, we have p0 = 4p1 = 4, and p2nsizes,

nsizes is the number of the nodes in the system. HP4, 4, nsizes. The total parallelism of

the system is p4*nsizes.

2.5 Performance parameters of the HPM model

According to above analyses and paper [12], we can use the decreasing ratio k andk+

to describe the performance decreasing of memory hierarchy in an HPM model. If the peak

performance of a processor in an HPM is denoted as s, the memory access speed of memory

hierarchy k is denoted as Vk, and the performance of memory hierarchy k+ is denoted as V k+,

then we have the following definition.

Definition 1 The decreasing ratio k is defined as the ratio of the performance of memory

hierarchy k to the peak performance of a central processor of the HPM, and k=Vk/s. The

7

decreasing ratio k+ is defined as the ratio of the performance of memory hierarchy k+ to the

peak performance of the central processor of the HPM, and k+=Vk

+/s.

Obviously, 0 <k 1, 0 <k+ 1. k

+ k, k+1 k, and (k+1)+ k

+.

The performance parameters of the HPM model can be summarized as follows:

P: number of processors;

s: peak performance of a processor in the HPM;

k: decreasing ratio of the performance of memory hierarchy k to the cpu;

k+

: decreasing ratio of the performance of memory hierarchy k+ to the cpu.

The real performance of an HPM algorithm can be analyzed by the parameters (p, s, k,k+)

and the characteristics of the HPM algorithm.

3. HPM algorithms

A parallel algorithm on an HPM(p, hm, hB) model consists of p independent

sub-algorithms, each running on a processor of the HPM system, and “communicated” by

some rules. Each sub-algorithm is a RAM algorithm with extra considerations. Here we call

the sub-algorithm as E-RAM algorithm.

3.1 E-RAM algorithms

Enhanced RAM (E-RAM) Algorithm: The traditional RAM algorithm with the

consideration of memory hierarchies and parallelism inside a cpu.

The E-RAM algorithms are different from the traditional RAM algorithms. For example,

on the RAM model, the three computing methods for matrix multiplication (inner, middle and

outer product) are computation equivalent, but on the E-RAM model, the three methods have

totally different computation behavior [6], as they have totally different memory behavior.

3.2 Parallel HPM algorithms

A parallel algorithm on an HPM model consists of p independent sub-algorithms, each

running on a processor of the HPM system, and “communicated” by some rules. Before we

discuss the HPM algorithm, we need to give some related definitions.

Computing (Parallel) interactive point

The Computing (Parallel) interactive point is the starting, ending, and the synchronous

points in a parallel computing.

8

Data interactive point

The data interactive point is the point for data accesses and data communication in a

computing.

The data interactive point can also be distinguished as vertical interactive point or

crossing interactive point.

Vertical interactive point

The vertical interactive point is the point of data accesses to any memory hierarchies with

the level k.

The examples are the general memory accesses, such as cache, main memory and disk

accesses.

Crossing interactive point

The crossing interactive point is the point of data accesses to any memory hierarchies

with the level k+.

One example is the communication points through network (memory hierarchy 2+) on the

cluster system or CoSMP system. Sometimes we call crossing interactive point as

communication interactive point, and vertical interactive point as memory interactive point.

Here we call all the above points as interactive points. Thus, a parallel HPM algorithm

can be defined as follows.

Parallel HPM algorithm

A parallel HPM algorithm consists of a set of p independent E-RAM sub-algorithms, each

running on a processor of an HPM computer. These sub-algorithms are interacted through

some interactive points (including the computing (parallel) interactive points and data

interactive points).

3.3 Examples of HPM algorithm

A sequential algorithm is the E-RAM algorithm.

An SMP algorithm is a set of p independent E-RAM sub-algorithms, each running on a

processor of an HPM computer with the SMP type. These sub-algorithms are interacted

through some computing (parallel) interactive points (such as “FORK”, “JOIN” primitives for

SMP system) and vertical interactive points (for memory accesses).

A cluster algorithm is a set of p independent E-RAM sub-algorithms, each running on a

9

processor of an HPM computer with the cluster type. These sub-algorithms are interacted

through some computing (parallel) interactive points (such as “mpi_init”, “mpi_finalize”

primitives for cluster) and crossing interactive points.

A CoSMP’s algorithm is a set of p independent E-RAM sub-algorithms, each running on

a processor of an HPM computer with the CoSMP’s type. These sub-algorithms are interacted

through some computing (parallel) interactive points (both for cluster and SMP system),

vertical interactive points (for memory accesses) and crossing interactive points (for

communication).

4. Performance analyses of HPM algorithms

For an HPM algorithm, the programming for computing (parallel) interactive point and

crossing interactive point needs special parallel primitives. Even the programming grammar

of E-RAM algorithm and vertical interactive points are the same as that for general traditional

sequential algorithm, their performance may be different. The performance of an HPM

algorithm depends on the E-RAM algorithm, the parallelism behavior of the algorithm, and

the influence of the computing (parallel) interactive points and data interactive points. The

real performance of the algorithm will depend on both of the architecture and algorithm

characteristics.

4.1 The performance of E-RAM algorithms

According to above definition, a sequential algorithm on an HPM system is an E-RAM

algorithm. It is obvious:

Lemma 1 If the timing of a RAM algorithm for solving problem A on a computer with p0

= 1 is TRAM, then the timing of the E-RAM algorithm for solving the same problem A on the

computer with p0 will be TE-RAM, and TRAM / p0 < TE-RAM < TRAM.

If only the computing complexity is considered, the idea execution time of an E-RAM

algorithm for algorithm A (with computing complexity W) on a processor of the HPM with

sequential peak performance s will be T1(t)=W/s.

Accounting to the characteristics of algorithm A utilizing the parallelism inside a cpu

(presented by p0 ), and the influence of the data interactive points, the real execution time can

be expressed by the following equation:

10

T1(r) = T1

(t)+T1(P0)+T1

(m) = T1(t)0

(P0)0

(m)).

Where 0(P0) = T1

(P0)/ T1(t), and 0

(m) = T1(m) / T1

(t). 0(P0) !0

(m) > 0, and

T1(r)>T1

(t). The 0 (P0) is the influence factor of parallelism inside a single cpu and the algorithm

behavior to utilize the parallelism, If the parallelism is well utilized, 0(P0)

"1. 0(m) is the

influence factor of memory accesses and the algorithm memory behavior. Smaller values of

0(m) means higher memory accesses efficiency.

Thus, the real performance of a sequential HPM algorithm can be expressed by S(r) =

W/T1(r)# $0

(P0)0

(m)%& ' ( 0T1

(t)/ T1(r) =

$0(P0)

0(m)), )0 depends both on the optimization characteristics of the algorithm

(memory accesses) and the architecture.

Here we discuss the influence of the memory interactive points. For an algorithm A with

computing complexity W, we use the following definitions:

Space complexity[11]M0: The necessary memory space for solving algorithm A.

Memory accesses complexity M1: the amounts of real memory accesses for solving

algorithm A on a real computer.

1: The ratio of the *(*+ **+ ,1=W/M0.

: The ratio of the space complexity with memory accesses complexity of A: = M0 / M1.

2: The ratio of the computing complexity with memory accesses complexity of A:

2 = W1 / M1.

: & * * *+ , = Cc/M0.

This value is related with the cache hit-rate. Large value of means the possibility for higher

cache reuse rate.

If the size of an algorithm is n, then the values of W, M0 and M1 will all be functions of

1 and 2. W1 and M0)1) are the intrinsic attribute of an algorithm,

no relevant with the implementation of the algorithm on special computer. However, the M1

does not only depend on the characteristics of the algorithm, but also relate with the

implementation style of the algorithm on special computer.

In general, we have1. If equal to 1, the algorithm cannot be improved from the

memory access point of view, however it does not mean that the algorithm has high memory

%&(*** 1=W1/M0 = O(ni), i > 0. In most cases,

11

the bigger of i, the bigger the reuse rate of algorithm A, and the higher memory accesses

efficiency. Usually, s used for the comparison of memory accesses efficiency of different

*(%- "1)1 is a higher order function of

n, the algorithm can have higher memory accesses efficiency.

For the algorithm of multiplication of two nn matrices , W = 2* n3, M0 = 3 * n2. The

“inner product” (denoted as (i,j,k)), “outer product” (denoted as (k,i,j)), “middle product”

(denoted as (j,k,i)) and the “blocking” methods are different implementation methods. If these

methods are programmed in FORTRAN on a sequential computer, the values of M1)2

can be listed in Table 1.

Table 1. The performance of matrix computing Algorithm M1 2

(ijk) (n1+1)n3+n2 3 / (n1 + (n1+1)n) 2n / (n + (n+1)n1)

(kij) n1 n2 + 2 n1 n

3 3 / (n1 + 2 n1 n) 2n / (n + (2n+1)n1)

(jki) n1 n2 + 2 n3 3 / (n1 + 2 n) 2n / (2n+n1)

blocking 3 n0 n2 1 / n0 2n / (3n0)

Here nl 1 is the length of the cache line (in words), nn0*nb. In the blocking matrix

multiplication, the matrices AB and C are divided as sub-matrix, each with size nbnb, and

there are n0n0 sub-matrices in each matrix. When a block can be totally kept in the caches,

the values will be improved further.

. + * 1=W1/M0=O(n), it should have good memory accesses

efficiency. However, the values of for different implementation methods (inner product,

outer product, middle product and blocking) are different. The value of for blocking

method is the nearest to 1. This means that the blocking method has the best memory accesses

performance among the four implementations. This concept is tested on one processor of

Power3-II. The practical results agree with our principles here. The testing results are given in

the paper [6].

4.2 Some assumptions

For simplicity and without loss of generality, we make the following assumptions for

parallel HPM algorithms.

If the parallel granularity is rather large, the timing of the computing interactive points

can be neglected compared with the computing time of the algorithm. Thus, an HPM

12

algorithm can be seen as a set of p independent E-RAM sub-algorithms, each running on a

processor of an HPM computer. These sub-algorithms are interacted only through some data

interactive points. If the timing of ith E-RAM sub-algorithm is Ti (1 , then Ti = TERAMi

+ Tipi. Here TERAMi is the computing time on a processor of the HPM system. Tipi is the timing

for the data interactive points. Thus, the timing of this HPM algorithm will be Tp =1 Ti,

or Tp =1 Ti. We will use the late one in this paper.

Thus, an HPM algorithm can be considered as p independent E-RAM sub-algorithms,

plus some data interactive points. The performance of an HPM parallel algorithm depends on

two kinds of factors: the parallelism of the algorithm with the overhead of the computing

parallelization and the overhead of the data interactive points. According to this principle, the

performance of HPM algorithms will be discussed in the next sub-section.

4.3 The performance of HPM algorithms

Based on the above assumptions, the performance of an HPM parallel algorithm mainly

depends on two kinds of factors: The parallelism of the algorithm, overhead of the data

interactive points: vertical (memory) interactive points and crossing interactive points.

In the theoretical ideal case, the computing time for an algorithm on the HPM will be Tp(t),

Tp(t) = W/(s*p), and the theoretical speedup will be Sp

(t) = p. For a real HPM algorithm, the

computing time will be influenced by the overhead caused by algorithm parallelization,

vertical (memory) interactive points and crossing (communication) interactive points. Thus

the real computing time can be expressed as: Tp(r) = Tp

(t) + Tp(c) + Tp

(k) + Tp(k+) = Tp

(t) *(1+ p(c)

+ p(k) + p

(k+)). Here the Tp(c) is the overhead caused by the algorithm parallelization, Tp

(k) is

the overhead for the vertical interactive points and Tp(k+) is the overhead for crossing

interactive points, and p(c) = Tp

(c) / Tp(t), p

(k) = Tp(k) / Tp

(t), p(k+) = Tp

(k+) / Tp(t).

4.3.1 The algorithm parallelism and computing consistency

If only the computing complexity is considered, the performance of a parallel computing

depends on two factors: the parallelism of the algorithm and the complexity redundancy of the

algorithm A(P) compared with its sequential algorithm A. Thus, the timing of the parallel

computing is: Tp(c) =Tp

(p) + Tpc(o). Here Tp

(p) depends on the algorithm parallelism and system

parallelism. Tpc(o) depends on the overhead of the computing and complexity redundancy.

Thus we have: TP(c) =Tp

(t) *(p(P)pc

(o)) =Tp(t) *p, P is the influence factor of parallel

13

computing, and p = p(P)pc

(o). p(P) = Tp

(p) / Tp(c) , pc

(o) = Tpc(o) / Tp

(c).

In general, p(P) 0. The better the parallelism of the algorithm the smaller the value of

the p(P). pc

(o) is related with the computing redundancy of algorithm. In most cases, we

have pc(o) 0. However, in some special cases, pc

(o) < 0. This may happen in some special

cases of searching algorithms. This is one of the reasons for “super-linear” speedup. We use

the term consistency to describe the computing redundancy of a parallel algorithm.

Consistency: consistency (Cpc) is the ratio of the computing complexity of a parallel

algorithm (denoted as Wp) to the computing complexity of its sequential partner (denoted as

W1).

If the value of Cpc is near to 1 (more general, if it is a constant), we say the parallel

algorithm is consistent with its sequential partner.

If only the computing complexity is considered, the computing efficiency of a parallel

HPM algorithm can be given as:

P(c) = Tp

(t) /( Tp(t) +Tp

(c) )= 1 /(1+p(P)pc

(o)) #$p.

It is only related with the parallelism and the complexity redundancy of the parallel

algorithm.

4.3.2 The influence of vertical (memory) interactive points

As the emergence of parallel computation, the behavior of memory accesses may be

different from the sequential case. Mainly, the memory access performance is affected by the

following factors:

%/*)'* (the ratio of the cache capacity of

(***+ (, = Cc/M0) may increase because of the use

of multiple processors. This leads to the possibility of higher cache reuse rate, and higher

memory accesses performance.

2). For the equation (5): tcj=([nl/lj]* lj)/ Bj + Lj, as the bandwidth is shared, and the

latency is accumulated by multiple processors, this leads to the possibility of lower memory

accesses performance than the sequential case.

From the above analyses, we have: Tp(k) = Tp

(t) * p(k) . In most cases, p

(k) 0. However,

sometimes, we may have p(k) < 0. For example, in the special case: = Cc/M0 increase very

much. This is also one of the reasons for “super-linear” speedup. The parallel LU algorithm is

14

a good example. LU decomposition for solving the set of linear equations has very good

parallelism on the SMP computer. When the size of the matrix to be solved is very large, the

overhead will be very small, and more data of the matrix may be stored in all the caches of all

the CPU’s of the SMP. The “super-linear” speedup may occur. We test this on the SGI Power

Challenge computer with parallelism 4, and the results are given in the table 2.

Table 2. The “super-linear” speedup of LU algorithm

n T1 T2 Sp

256 0.18 0.077 2.34

512 1.37 0.483 2.83

768 5.40 1.660 3.25

1024 24.40 4.668 5.23

Here n is the dimension of the matrix, T1 is the timing for LU algorithm on one processor

of the computer, T2 is the timing for LU algorithm on the parallel computer, Sp is the parallel

speedup, and Sp = T1 / T2. It can be seen that when n is very large, we may get “super-linear”

speedup.

4.3.3 The influence of the crossing interactive points

The effecting of crossing interacting points can be expressed as:

Tp(k+) = Tp

(t) * p(k+) .

In fact, the value of p(k+) is related with the ratio of the overhead of the crossing points

with the parallel computing complexity of the algorithm. It depends both on the

characteristics of the parallel algorithm and the implementation strategy of the algorithm on

an HPM computer. The implementation strategy of the algorithm on an HPM computer is

mainly depends on the structure of the memory hierarchy with the level k+ and the data

partition strategy for the algorithm on the HPM computer. It is obvious that when we solve a

parallel algorithm the characteristics of the parallel algorithm structure of the memory

hierarchy with the level k+ on the HPM computer is relatively fixed. The partition strategy is

critical to the performance of the algorithm.

To analyze the influence of communication interactive points, for a section of an

algorithm or a complete computation (abbreviated as section), for example, a loop body, or an

independent section of a program, we use the following ideas.

15

Definition modified array

The modified array is defined as the left hand array of an assignment statement in the

loop body.

Definition relevant array

The right hand array B of a statement in the loop body is called the relevant array of the

modified array (A). If array C is implicit related with array B, it is called implicit relevant

array of A. If A is also appeared in the right hand side then it is called self relevant.

Definition relevant elements

The elements of relevant array B, which is related with the modification of an element

A(i1,i2,…,ik) of a modified array A, are called the relevant elements of A(i1,i2,…,ik). The total

number of the relevant elements is called the relevant number of A(i1,i2,…,ik).

Definition relevant distance

If an m-dimension array A is a self relevant array, then the relevant distance of an

element A(i1,i2,…,im) (1ijnj) on the j-th direction is defined by the following equation:

d(j)(i1,…im)=|S-ij|, where A(i1,i2,…,S,…,im) is the relevant elements of A(i1,i2,…,ij,…,im) in

the j-th direction.

Theorem 1

If array A is not included in its relevant or implicit relevant arrays, then there exists a

partition of array A, such that, there is no communication for array A when parallel the loop

body.

However, we need to point out that the parallel penalty is the extra memory space for the

computation. Thus, for the computation with large memory, this method is not practical.

Theorem 2

If the implicit relevant distance of the array in the i-th direction is equal to 0, then the

partition of array A along the i-th direction may lead to a parallelism without communication.

We omit the proof here. The most general parallel method of the computation is the data

partition, especially for arrays. For iteration computing, not only the communication behavior,

but also the number of iterations and amounts of computing in every step depends on the

partition style. When the communication cannot be avoided, the following principles should

be used for higher communication performance.

16

A good partition of array A should satisfy the following conditions: 1) the ratio of

computing complexity and communication is bigger; 2) the total amounts of communication

is small; 3) The total number of communication operations are small.

For simplicity and without loss of the generality, we only discuss parallel sections here.

The influence of crossing interactive points depends on the array partition strategy. It can be

analyzed from several cases below.

CASE 1. No data communication, but may need multi-copies of data.

This case is happen when all level of the loops are parallelizable, and the distance is

zero. For example, the matrix multiplication can be parallelized in any level of the loops, and

the distance is zero. If the memory space is big enough, we can parallel the algorithm by the

direction i of the array A and C, copy array B in every computer, no communication is needed

in this computing.

CASE 2. Data communication on the boundary

If the data communication is needed only in one direction, and some other directions can

also be parallelized, then the communication may be avoided in this computation. In this case,

we only need to parallel another direction, and partition the array in strip. If the data

communication is needed in every direction, we can partition the array in strip or square.

CASE 3. Data communication in different direction

The computation can be divided in several (for example two) sections, the data

dependency is happened in different direction, and communication is also in different

direction. If we partition the arrays and parallel the computation in a fixed direction, the

communications may be very high. 2-d FFT and ADI computation are such examples. One

possible choice for this case is parallel the computation and partition the array by different

strategies, and use necessary data re-displacement between the sections of the computation.

CASE 4. Higher data reuse rate

When the data reuse rate is high in a computation, the block method can be used for high

memory performance in every hierarchy. The Linpack and matrix multiplication with

blocking are good examples.

CASE 5. Macro pipeline parallel strategy

In this case, the vertical-horizontal partition can be used for higher performance [6].

17

We discuss these related topics and examples in another paper [6].

4.3.4 The efficiency of parallel HPM algorithms

If all the above factors are considered, the timing of a parallel computing is:

Tp(r) = Tp

(t) + Tp(c) + Tp

(k) + Tp(k+) = Tp

(t) *(1+ p(c) + p

(k) + p(k+)).

The real efficiency of a parallel HPM algorithm can be given as

P(r) = Tp

(t) / Tp(r) = 1 /(1+ p

(c) + p(k) + p

(k+)).

It depends on the parallelism of the algorithm, overhead of the parallelism, overhead of

the computing complexity redundancy, overhead of memory interacting points and crossing

interacting points.

5 Conclusion

The parallel computing model HPM is introduced and evaluated in this paper. This model

describes the general homogeneous parallel computer systems with hierarchical parallelism

and hierarchical memories. Some commercial parallel computers are illustrated according to

the HPM model. For a general HPM computer, the parallel algorithm is defined. The

performance of the parallel algorithm is also analyzed. From this discussion, we can see that

this model is good for analyzing the scientific computing algorithms.

References:

[1] Claudia Leopold, “Parallel and Distributed Computing”, John Wiley & Sons, 2001

[2] C. Xavier et al. “Introduction to Parallel Algorithms”, John Wiley & Sons, 1998

[3] Valiant L G. A bridging model for parallel computation. Communication of the ACM, 1990,

33(8):103~111

[4] Culler D et al. LogP: Towards a realistic model of parallel computation. In: Proc. of the ACM

SIGPLAN Symp. on Principles & Practices of Parallel Programming, 1993: 1 ~ 12

[5] Alpern B et al. Modeling parallel computations as memory hierarchies. In: Proc. Programming

Models for Massively Parallel Computers. Sept. 1993

[6] Xiangzhen Qiao, et al. Algorithm optimization on HPM model. Submitted.

[7] D.E.Culler, J. P. Singh, and A.Gupta, Parallel Architecture: A Hardware/Software Approach, San

Francisco: Morgan Kaufmann, 1999.

[8] NCIC, Cluster Dawn3000, NCIC, Beijing, 20021.

18

[9] N. J. Boden, et. al, “Myrinet-A Gigabit-per-Second Local-Area Network”, IEEE Micro., Vol.15,

February, 1995, pp. 29-38.

[10] Gordon Bell & Jim Gray, High performance computing: Crays, Clusters, and Centers, What Next?

August 2001, MSR-TR-2001-76.

[11] A V. Aho etal. “The Design and Analysis of Computer Algorithms”, Addison-Wesley Publishing

Company, 1976

[12] Xiangzhen Qiao, et al., “Double-stream” analysis for the performance of parallel computations,

Computer Science, Vol 28 (10), Oct. 2001, pp7-12.

A Visual Environment for Specifying Global Reduction Operations

Robert R. Roxas and Nikolay N. MirenkovGraduate School Department of Information Systems

University of Aizu, Aizuwakamatsu City, Fukushima, 965-8580 Japanemail: d8032102, [email protected]

Abstract

Parallel computing is currently held as the most efficient way for making programs to solve scientific application prob-lems. It is because scientific application problems usually involve the manipulation of a very huge volume of data. Thisactivity makes sequential computing not a very useful approach. However, making parallel programs is not an easy taskfor programmers. They need to have a good tool or environment wherein they can easily specify what they want to solve.Because of the complexity in making parallel programs, visual programming is a very promising approach in this area. Infact, there were already several attempts to provide a visual environment for this purpose but only achieved a little success.One of the probable reasons for such a phenomenon to exist is the visual representation of the programming constructs beingemployed. The commonly used approach is to create or draw some graphs, which will then be annotated by attaching someprogramming codes using the conventional text-based style.

Because of the important role that parallel programming plays in scientific computing, it is necessary that a more simpleand effective way of specifying computational activities in a parallel program be provided. In order to achieve this idealgoal, we also take this into account in the new visual programming environment that we are developing. Our new systemallows the creation of programs from algorithmic “film” specifications, where the use of text is minimal. In this system,a language of films and a language of micro-icons are used depending on the nature of specifying the needed operation.This paper discusses how to specify global reduction operations using a language of micro-icons and shows how the visualprogramming environment supports the manipulation with these micro-icons.

Keywords: global reduction, master/slave, parallel computation, visual environment, algorithmic film

1. Introduction

Parallel computing is currently held as the most efficient way for making programs to solve scientific application problems.It is because scientific application problems usually involve the manipulation of a very huge volume of data. This activitymakes sequential computing not a very useful approach. A lot of researches have been done to come up with good parallelalgorithms to achieve the goal of having efficient parallel programs for a particular domain. However, it can not be deniedthat even at present, making parallel programs is still not an easy task for programmers. They need to have a good tool orenvironment wherein they can easily specify what they want to solve. Because of the complexity in making parallel programs,visual programming is a very promising approach in this area.

Visual programming approaches come in different forms all of which aim to provide a language that is easy to understandand an environment where making programs (parallel or sequential) is made easy. Researches in this area started more than 10years ago but with a limited success and its progress is seemingly slow [21]. There were researches made on visualization forparallelizing programs [8], visualizing parallel and/or concurrent programs [39, 34], visualizing parallel software execution[39], animating or visualizing parallel algorithms [23, 39], visualizing memory access behavior of parallel programs [40],parallel simulations [4], etc. Visual approach is not limited to parallel programming only but for sequential programmingas well. In fact, visual approaches are also being considered in different types of applications such as in object-orienteddatabases and programming [6, 14], SQL databases [5], XML queries [10], multimedia applications [12, 19], rule-basedprogramming [28], etc.

Despite the efforts that have been exerted in visualizing the different aspects of parallel computation, it is evident thatvisual programming especially for making parallel programs has not fully matured yet. One of the probable reasons forsuch a phenomenon to exist is the visual representation of the programming constructs being employed. The commonlyused approach is to visualize computation using graphs [20, 2, 27]. In making programs, the programmer usually creates ordraws some graphs, which will then be annotated by attaching some programming codes using the conventional text-basedprogramming style. This may be the reason for using the term graphical programming [21, 2, 25] instead of just using visualprogramming. In the graphical approach, the nodes represent some codes that will perform some computation and the arcsrepresent the flow of data and/or computation. In some existing languages, the arcs are called wires [21, 42]. The use ofgraphs is useful and easy to understand in some situation especially for small programs but when a program gets large,difficulty arises because of the limitation imposed by the size of the computer screen.

Because of the important role that parallel programming plays in scientific computing, it is necessary that a more simpleand effective way of specifying computational activities in a parallel program be provided. In order to achieve this idealgoal, we also take this into consideration in the new visual programming environment that we are developing. This newsystem allows the creation of programs from algorithmic “film” specifications [22, 41, 44, 45, 9], where the use of text isminimal. Although this system also employs visual programming, it doesn’t use graphs as the primary means of specifyingand viewing the different parts of a program. Algorithmic films of a multiple view format are used in which the different partscan be specified and/or viewed in a non-linear order according to user’s demands. In this new programming technique usingsuch films, a programmer can browse, watch, edit, and specify or attach operations to film frames inside some panels. Whencreating a program, he will just be combining some films from a database suitable to his needs and updating the operationswithin the chosen films. The source code will be generated automatically by the system using the templates from a database[9]. In generating the source code, the C language (for non-MPI related commands) and the C version of MPI (for MPI-related commands) are used. In this visual environment, six subsystems are being used to implement the six different groupsof frames related to the different views that the system offers to a programmer. The goal of these different views is to allowa programmer to understand better his programs from different points of view. In the past, the idea of using animation inmaking programs was restrictive and reduced to straightforward and often boring visualization of some intermediate resultsof computation, see [39].

One of the six subsystems in this visual environment is a visual or graphical interface for specifiying I/O operations forsoftware components and communication between components. The specification of communication among processes in aparallel program is just a partial case of the latter. Thus we don’t show or discuss the main visual environment here. Interestedreaders can find information about this in [22, 44]. What we present here are the specification panels related to specifyingglobal reduction operations within this visual environment. But because these panels are displayed and manipulated withinthe main visual environment, we consider these as the visual environment for specifying the global reduction operations. Ingeneral, its objective is to come up with a tool where programmers don’t have to memorize a lot of things in order to use thesystem. This is one of the factors why we deem visualization as an important programming tool. Some may say that realprogrammers don’t really need to memorize programming commands, like MPI global reduction operations, because theycan easily consult the users’ manual. This maybe true for some people but maybe not for other people. As can be observed,a lot of computer users shifted from using the DOS environment to the Windows environment. There must be some reasonsfor this phenomenon. In the DOS environment, one has to know a lot of commands with their parameters and switches toeffectively use the operating system. On the other hand, these things are not necessary when using the Windows environment.Even for the Unix system, many people developed some graphical environment to enhance the pure text-based system andthere are many applications running in Unix systems which are graphical. If a visual or graphical environment is considereduseful for computer users, then visual programming environment will be very useful for programmers, especially that makingparallel programs is a difficult task.

To illustrate the features of this subsystem, the specification of the global reduction operations based on the master/slavescheme of computation is presented. Any of these operations is attached to a master/slave film using a language of micro-icons. Our approach is not just limited to global reduction operations, other types of collective communication operations arealso being considered and are found in [30]. We are trying to provide micro-icons that are intuitive enough for understandingeven for a novice programmer. The number of micro-icons can grow as new features are added to the system because they arean open set. When there are too many micro-icons in the system, the possibility is anticipated that a programmer will needsome aid in understanding the different micro-icons. Thus a system help which comes in three forms: animation, sound, andtext is included in the system.

In this paper, there are no novel algorithms introduced to improve the current state of problem-solving, especially forscientific computing. Also, it does not present any experiments with existing systems. What is described here is just a part of

a visual programming environment that aims at making programming, especially for scientific computing, a bit easier. Whatis presented here is just an approach that aims at providing a visual support for the existing MPI library and the examples usedare limited to the specification of global reduction operations only. In Section 2, some works which have some relations to thisresearch are described briefly. These works have influenced this current research in some way and formed the fundamentalbasis for a quest for a new visual programming paradigm and language. In Section 3, an overview of the visual programmingenvironment called “Active Knowledge Studio” (AKS) [22, 44, 45, 41, 9] is given. Then the specification of the differentglobal reduction operations is illustrated in Section 4 . In Section 5, the discussion about source code generation is given.Finally, this paper is concluded in Section 6 with statements about the future direction of this work.

2. Related works

A lot of researches have been done to provide an environment where making (sequential or parallel) programs will be madeeasy. Even at present, many researchers are still studying on how to make programs from visual symbols in order to facilitatesoftware constructions. These people believe that such an approach will make the construction of programs easy. This shouldbe considered as an ideal goal, especially for parallel programming, because of the complexity in making parallel programs.The following paragraphs describe briefly some of the systems that use visualization as a technique to make parallel programseasy to create and their computational activities easy to understand. In genereal, the common technique that they used is tomake use of graphs where nodes signify some computations and arcs signify communication paths. One can consider thistechnique as partial in the sense that they are just used to visualize the flow of computational process. In reality, it is justthe outward appearance of a program that they are trying to visualize. The programmer will still go back to text-basedprogramming usually using the syntax (or a close variant) of the programming language used to run the program.

HeNCE [2] is a graphical parallel programming environment intended to simplify the construction of parallel programs.In fact, the authors claim that it is a tool that greatly simplifies the writing of parallel programs. When making parallelprograms, the programmer draws some directed graphs. In these directed graphs, the nodes represent calls to conventionalsequential subroutines written in C or Fortran and the arcs represent execution ordering constraints. The program is thenautomatically translated into an executable parallel program which will be run under the PVM [11] system. The graphicalaspect of the system is quite easy to use but the creation of actual codes that would be run by a node is quite difficult. Thisowes to the fact that the programmer has to use conventional programming language again like C or Fortran and the bulk ofthe work is on specifying the program textually. Thus one can consider this system as quite difficult to use and maybe proneto syntactical and/or typographical errors. It is, therefore, ideal if the conventional programming style can be done away withor will be minimized, if not possible to eliminate it.

Another graphical parallel programming environment is the CODE [27, 25] system. The purpose for creating this systemis to provide an environment in which users can create parallel programs by drawing and then annotating a picture. Soprogramming is reduced to drawing pictures and then declaratively annotating them. Depending on the kind of informationneeded to annotate a node, annotating it can be either simple or not so simple. When setting the different attributes of anode, this is done by filling in a form with some text. This is simple but in specifying the code of a node, the code whosesyntax resembles to that of the C language should be input from the keyboard and it needs the help of a text editor. Althoughits syntax is like C, there are additional symbols and their corresponding semantics to memorize. Certainly, this is one stephigher than a pure text-based programming style.

Paralex [1] is described as a complete programming environment that makes extensive use of graphics to define, edit,execute and debug parallel scientific applications. It is a visual environment designed for making parallel programs indistributed sytems. It is aimed at exploring the extent to which parallel application programmer can be liberated from thecomplexities of distributed systems. Its program is composed of nodes and links where nodes correspond to computation(function, procedure, programs) and links indicate the flow of (typed) data. Although one of the components of Paralex is agraphics editor for program development, still the programmer must supply the codes of a particular node of interest. It stilluses a text editor to create or modify the source file or the include file of a given node. The programmer has to know the syntaxof a particular language like C, C++, Fortran, Pascal, Modula, or Lisp. Thus it is just a step higher than a pure text-basedprogramming style, which also maybe prone to typographical and/or syntactical errors and may need memorization.

Visper [38, 37] is a graph-based software visualization tool intended for making parallel message-passing programs. It usesa formalism that combines control flow and data flow graphs into a single visual formalism. Such visual formalism is calledProcess Control Graph (PCG) [36]. Visper has a PCG visual editor that is user-friendly to be used for creating programs.When creating a program, a programmer draws some graphs that show parallel programs or modules. These graphs will thenbe annotated by filling in a set of forms, e.g., ExecuteForm for execute blocks. Execute blocks are considered as the basic

building blocks for assembling the code into a program. They are used to add some codes to a PCG and to describe howsequential components are composed into a parallel construct. The created graphs can be compiled to automatically composethe program. If there are no syntax errors, a text file of the program is saved ready for compilation into C or Fortran. Thisis quite close to what is presented in this paper in the sense that it tries to minimize typing or programming using text as inthe case of filling in a NodeForm. However, in some cases like defining some sequential computation to be executed by the(all or exclusive) execute blocks, they are entered as text just like making modules in a conventional programming language.Thus they claim that programming in PCG is not visual in all aspects [36].

Another visual parallel programming environment for message-passing parallel computing is VPE [26]. It is intended toprovide a simple human interface for creating message-passing programs. When making programs, the programmer specifiesthe parallel structure of his programs by drawing pictures. VPE computations are represented as graphs where nodes representprocesses and arcs represent the paths upon which messages flow from process to process. Each graph contains computation(comp) nodes, which will be annotated with program text expressed in C or Fortran that contains simple message-passingcalls. Although this system provides a visual environment for the creation of parallel programs, it is still not visual in allaspects. The programmer still needs to make codes in text-based approach and has to memorize some syntaxes to be able touse the language. Thus it is still quite difficult to use and maybe prone to typographical and/or syntactical errors as well.

GRAPNEL [15] is a visual programming language designed for creating distributed parallel programs based on message-passing programming paradigm. It is considered as a hybrid language that uses both graphical and textual representation todescribe the whole distributed application. Graphics are used to show those program parts which are important concerningparallelism. Textual descriptions are used to specify some program elements like data declarations and definitions, codesegments without any communication, etc. They are written in C or Fortran. The technique is quite good but there are stillmany things to memorize not only the different graphical symbols used but also the textual syntax to be associated with ordefined for a particular graphic symbol. Thus it is just a partially user-oriented system.

Of course, there are still other systems that also provide a visual environment to make it easy to create parallel programs butthey are normally graph-based variant of what have been described above. Other related visual systems for making parallelprograms include systems like DIVA [29], Meander [43], and VADE [23] to name a few. Just like the ones mentionedabove, these systems do not fully provide an environment where programming will be done without resorting to text-basedprogramming style or by minimizing the use of text in constructing parallel programs. In fact, programming in text-basedapproach plays a major part in the development of software in these systems.

Our visual programming environment uses algorithmic films instead of graphs as the major components of the program.An overview of this system is given in the next section. In general, the use of text is minimized and the use of micro-icons inprogramming is maximized. It also aims to eliminate memorization in making programs by providing intuitive micro-icons.If the intutiveness of the micro-icons is not enough, the system supplies simple to understand system help in the form ofanimation, sound, and text when necessary.

3. An overview of our system called Active Knowledge Studio

“Active Knowledge Studio” (AKS) [22, 44, 41, 45] is our on-going research in the area of “Self-explanatory Component”technology. In this system, six different subsystems are being considered in order to implement the six different viewsrelated to the different features of a software component. These features are related to: 1) the computational schemesused (Multimedia & Skeleton group), 2) the variables and formulas used (Variables & Formulas group), 3) I/O operationsand/or communication (Input/Output group), 4) the integrated view of the algorithm (Integrated View group), 5) the auxiliaryinformation (Registration group), and 6) the links to some other pieces of information (Links & History group). Appropriatepanels will be provided for each of these groups.

In the AKS visual environment, a programmer creates a program simply by combining some algorithmic films from adatabase. In general, he does not create a film from scratch. He can browse some films provided by the system and viewthe desired groups of features. If a film is appropriate for his needs, he can edit it by specifying its size, attaching somevariables to it, defining some computational operations, specifying I/O operations and/or communication, etc. All theseactivities are done within the visual environment. He can combine several films to meet his computational needs. A programmay, therefore, consist of one or more films depending on the complexity of the program. A film may consist of one or morescenes for achieving some tasks. A scene is a set of stills or frames performing some related tasks. Each still is a single stepof a computation and is the smallest item in a film where the programmer specifies or defines an operation.

A typical operation in a master/slave scheme of computation can be used to illustrate when to group the stills into ascene. The slave process usually performs at least three things. It receives data from the master process, performs some

computation, and sends the results of computation back to the master process. These three activities can be considered ashaving three scenes. Scene 1 will be used for defining the operation of receiving data. Scene 2 will be used for specifyingsome computation. Scene 3 will be used for specifying the send operation. Depending on the complexity of an operation ineach scene and the algorithm used, a scene may consist of one or more stills. For instance, if a programmer wants to specify amatrix multiplication using the master/slave scheme, each slave process must first receive a row from the first matrix and theentire second matrix before doing any computation. Since these activities are all receive operations, these can be consideredas one scene, with at least two stills, i.e., still 1 is for receiving a row from the first matrix and still 2 is for receiving the entiresecond matrix.

In the filmification approach, operations are defined in a still. There are many kinds of possible operations that need to bespecified in a still. These operations can be a specification of some formulas for solving a fragment of the entire application,an input operation for initializing the data, output operation after computation, or communication with other processes in aparallel environment, etc. For example, under the Multimedia and Skeleton group, a programmer can specify certain typesof node flashing in a still. He can also select some nodes only or all of the nodes of a still if necessary, see [22, 44, 45]. Thispaper presents some examples of attaching global reduction operations to a still of a film. These operations are under theInput/Output group. A language of micro-icons is used to accomplish the task of attaching this kind of operation to a still.In specifying any of these global operations, the system provides a panel designed for a particular kind of global reductionoperation. The specification panel contains some buttons with example micro-icons that will give the programmer an ideaof the function behind the micro-icons. The programmer will just click on any of the buttons that he thinks necessary to bechanged.

One may wonder if it is possible to specify computation without using text at all. So far it is impossible but can beminimized. Text is still used in this area but its use is minimized. A special kind of text is used, which is considered to be in“enhanced” format. Details about this kind of text are found in [22, 44, 41].

4. Specifying global reduction type of operations

This paper presents how to specify the global reduction types of operation in the AKS visual environment. The mas-ter/slave scheme of computation is chosen as the model in illustrating the examples. In the examples, it is assumed that theprogrammer has already selected a master/slave film from the Multimedia and Skeleton group and wants to specify a globalreduction operation. He can now navigate on the scenes by clicking on a particular micro-icon for a particular scene of inter-est from the display of scenes. The corresponding still(s) of the chosen scene will be displayed inside the area for displayingstills. The programmer must switch to the appropriate still before specifying an operation. The operation defined in a still isapplicable to the selected nodes only. The subsections below illustrate how to specify the different types of global reductionoperations to a still of a film. The desired still should be selected first, otherwise, he may be specifying an operation to thewrong still. The attached operation will always be for the current still.

All specifications discussed in the subsections below are viewable in other modes like animation and tile modes similar tothe examples discussed in [31]. The tile mode allows the programmer to view his set of specifications several frames at a time.This is considered as viewing dynamics in a static way. Other similar panels are discussed in [32, 33] which were modifiedand included in [31]. However, all of them are used to visually specify Input/Output operations of a software component. Butthe examples discussed below are used for visually specifying the global reduction operations in a parallel program using themessage-passing paradigm.

4.1. Reduce operation specification

Figure 1 shows an example specification panel for defining a reduce operation. The panel is supplied with an examplemicro-icon for each button with replaceable micro-icon. This makes the operation specification both easy to use and easy tounderstand. It is easy to use because the programmer has a visual guide at hand. The example micro-icons serve as visualguides. It is easy to understand because the example micro-icons are intuitive enough to convey the meaning of the operationas depicted by the micro-icons. To enhance the intuitiveness of a micro-icon, text is also used. The convention used for thistext is to use italics font when it is part of a micro-icon but normal font when used as a variable name. For example, in Figure1 “BARRIER” is italicized because it is part of the micro-icon while “SBUFF” is in normal font because it is used as the nameof a variable that contains some data. Moreover, the programmer can easily obtain a system help if he is not sure or has somedoubt as to the meaning of a micro-icon. This system help will be discussed later in this subsection. In general, the examplemicro-icons are more or less sufficient, thus only a simple fine-tuning is necessary.

The appearance of the panel generally varies according to the number of necessary buttons to specify the needed operation.But for the four global reduction operations discussed in this paper, all of them have the same appearance. For other typesof collective communication, like scatter and gather, they are slightly different but use a similar approach [30]. Figure 1contains three (3) areas, the center of which is for specifying the main operations.

Figure 1. The reduce operation specification panel.

The first area (left) contains a node indicator signifying the type of nodes [44, 45] to which the operation is defined. Asshown in the figure, the node indicator is a micro-icon that is made up of a set of simple squares. It means that the definitioncontained in the panel is a specification for an activity defined for a group of nodes. This is obvious for any global reductionoperation because the operation is intended for a group of participating processes. The color of the node indicator dependson the color of the node(s) to which the given definition is attached. The programmer must select a node or group of nodesin a still of a film to which a particular operation will be attached. The chosen node(s) will be shown with a certain colordepending on the color chosen by the programmer. During the execution of the program, these chosen nodes will be flashedin some way depending on the activity of a given still. For a discussion of the different kinds of flashing nodes and theirfunction, see [9, 45]. For a definition attached to a single node, a different micro-icon is used. In that case, the micro-iconshows a single square rather than a set of squares. It should be noted that there are two shapes of nodes that the system uses.The first one is the ordinary node represented as a circle and the other one, that calls some functions or films, represented bya square. The system automatically shows this node indicator based on the available information like the type and color ofthe nodes.

The second area (center) of the specification panel is used to define the operation itself. This part is called the MainDefinition Area (MDA). This area contains two non-replaceable micro-icons, i.e., the micro-icons with white background,and two sets of buttons with replaceable micro-icons, i.e., those buttons with gray background. The MDA of the exampleshown in Figure 1 contains two non-replaceable micro-icons and six buttons with replaceable micro-icons. The systemautomatically supplies these two non-replaceable buttons. The top one shows the type of operation that is being used. In thisexample, the micro-icon used is for representing a global reduction operation. It portrays a configuration of a set of processes.As indicated by the arrow, the left part means the initial configuration and the right part is the final configuration after theexecution of the operation. Processes are depicted as small boxes aligned vertically. When processes are involved in a givenoperation, their color is black while those that do not participate, their color is white. As can be seen in the example, the initialconfiguration shows all of the small boxes with a black color. These small boxes colored black represent the processes thatperform an operation that will result to the configuration shown on the right side of the micro-icon. The example micro-iconalso shows that only one process has a black color after the operation. This means that only one of them is involved in theoperation, i.e., receiving the result of computation. From the small boxes on the left, one can see lines connected to a smallwhite box at the center representing the result of the operation that will be sent to a single process colored black on the rightside of the micro-icon. Each line has a small geometrical shape which has a distinct shape. This is so because in a globalreduction operation, different processors have different or heterogenous data. The heterogeniety here refers to the value butnot to the type of data.

Figure 2 shows the different micro-icons for the different types of collective communication using the message-passingparadigm, which of course include the global reduction operations. The four micro-icons at the top (first row), from leftto right, are for the operations: broadcast, scatter, gather, and alltoall types of communication. The four micro-icons atthe middle (second row), from left to right, are for the operations: allgather, reduce, allreduce, and reduce-scatter types of

Figure 2. Micro-icons for message-passing operations.

communication. The four micro-icons at the bottom (third row), from left to right, are for the operations: exclusive-scan,inclusive-scan, left-shift-exchange, and right-shift-exhange types of communication.

The other non-replaceable micro-icon is used to show the type of structure being used. The system also automaticallysets this button because it already knows the type of structure to which this operation is being defined. The example used inFigure 1 shows a master/slave scheme. The master process is represented as a gray colored box which has some connectionsto the different slaves depicted as small boxes with a black color. It also has some connections to the data elements, which areshown at the upper part of the micro-icon. Of course, the presence of this micro-icon is not very important in the specificationpanel under the edit mode because the programmer can see the current structure from the display of stills. Edit mode viewis just one of the views supported by the system. Animation and tile views are supported as well. The system supports otherviews to help the programmer to easily view his set of specifications several frames at a time. When using these other views,the programmer may not see the current structure under consideration. Thus the presence of this micro-icon signifying thestructure to which the specified operation is attached is important in order to help him to recognize easily the structure beingused, especially if there are many types of structures used in a complex system. To balance with what a programmer can seein the edit mode as well as in other modes, the structure is also included in the specification panel while under the edit mode.These other modes and their functions are discussed in [31].

The next discussion will be for the first four buttons with replaceable micro-icons. These micro-icons belong to theSource Definition (SD) group. The upper left button contains a micro-icon showing the name “SBUFF” (send buffer) whosestructure is a 2-D type containing double-precision floating point numbers. If the programmer wants to change the givenexample micro-icon, he can edit this, e.g., to another dimension, structure, or data type. It is assumed that such a datastructure has already been defined prior to the specification of such an operation. It should be pointed out that in MPI [24, 7]the buffers are usually a 1-D array but in real application, other dimensions are also used and manipulated. To allow the useof other multi-dimensional array and to comply with the requirements of MPI, any higher-dimensional data will be savedinto a 1-D array before calling an MPI function. Another solution would be to declare a new data type because MPI supportsit. The upper right button of SD contains a micro-icon showing the type of operator used in this example operation. Themicro-icon shows the commonly used symbol for a summation operation. Other types of operators including user-definedones can be specified in this button but the example is just limited to the summation operator.

The next button found at the lower left side of SD contains a text label “INFO.” This simplifies the specification panelfor message-passing operation. In MPI [24, 7] collective communication such as scatter, gather, alltoall, and allgather, theprogrammer has to type twice the count and type of data to be sent. This is done both at the sender and the receiver side.The rationale for this is to make sure that the amount of data sent is equal to amount of data received. This is also to makesure that the type of data sent and the type of data received are of the same type. Since these pieces of information aredeclared twice, it is better to show just one in the specification panel and automatically supply these pieces of informationin the source code. In fact in most cases, these pieces of information are significant only at the root process. For globalreduction operations, they are slightly different from the collective types of communication mentioned above. But for thesake of uniformity, the same technique is applied in specifying any global reduction operation. This button containing atext label “INFO” specifies the name of the structure that will hold these pieces of information. These pieces of informationinclude at least the type of data and its size or counts, which can be set by the system automatically. This button also includes

the information whether a single process called root or members of a certain group are involved in the specified operation. Incase of a root process, it will also store its rank. In case of a group, it will also store the information identifying the groupcommunicator. Additional pieces of information like the ones discussed in [3, 16] for ensuring the causal order of messagescan also be included depending on the implementation. The system can set most of these pieces of information automatically.These pieces of information are combined to simplify the design of the I/O specification panel in AKS.

The lower right button of SD, specifies a barrier synchronization. The micro-icon shows an image of some synchronizingprocess before (to the left of) the clock image symbolizing time. Since the barrier is to the left of the clock image, it meansthat a barrier synchronization must take place before the operation defined in the specification panel will be executed. It canbe changed to the kind of barrier synchronization where the image of the clock is to the left of the synchronizing processes.In that case, a call to a barrier synchronization must take place after the execution of the specification defined in a givenpanel, e.g., reduce operation. If this is the case, the system is instructed to synchronize all processes in such a way that theslave processes are already free and ready to perform some data manipulation or computation after the given specificationhas been executed.

Figure 3 shows the different barrier synchronization options that can be used. The top row (from left to right) consists of: abarrier that waits for a single process, a barrier that waits for some processes, a barrier that waits for all processes, and a casewhere the processes are to wait for some events. As depicted in the first three micro-icons from the left, the synchronizedprocesses are found before (to the left of) the clock image. This means that a barrier synchronization must be called bythe processes before proceeding to the next operation specified in the specification panel. The first three micro-icons at thebottom row of Figure 3 are almost the same as the corresponding micro-icons found at the top row. The difference lies onlyin the placement of the clock image. The synchronized processes are found after (to the right of) the clock image, whichmeans that synchronization must take place after the operation specified in the specification panel has been executed. Therightmost micro-icon at the bottom of Figure 3 is for specifying no synchronization. A barrier micro-icon uses a vertical linefor representing the synchronization points. The white small circle represents a waiting process and the small black circlesto the left of the synchronization line signify processes that have not reached the synchorinzation points. If the small blackcircles are found to the right of the vertical line, they signify processes that have passed the synchronization line. Other typesof barrier can also be used in different situations. One possibility is to wait for a particular process. This is symbolized by asingle black circle to the left of the synchronization line. Another type is a barrier that waits for a number of processes only.This is symbolized by more than one small black circles to the left of the synchronization line but with a black circle to theright of the synchronization line.

Figure 3. Synchronization options.

The next discussion will be about the Target Definition (TD) group of buttons in the specification panel. For a reduceoperation, what is needed is to specify the receiving buffer and the process to which the data will be sent. The button at thetop shows the name of the receiving buffer, in the example, “RBUFF” together with the structure and its data type. In MPI[24, 7], it is required that the receive buffer be of the same size with the send buffer. The receive buffer is significant onlyat the root. In the example, the same structure is also used as indicated by the upper left button of SD (send) and the upperbutton of TD (receive). Of course for the data type and the size of the structure, it must also be the same as the ones found inthe sender’s buffer. The appearance of the 2-D structure in this micro-icon is just a visual picture. The 2-D data is actuallyconverted to 1-D array behind the scene. The button at the bottom of TD specifies the root process to which resulting datawill be sent after the operation. It contains the text label “ROOT” as the variable identifier to name the receiving process.

As shown in Figure 1, there is no way to specify the type of data and amount of data to be received. These pieces ofinformation are accessable from the data behind the definition of the button with a text label “INFO” found at the lower

left part of SD group. The size of the structure and the type of data it contains are automatically saved in the “INFO”structure. Another pieces of information that will be included in this “INFO” strucuture is the indicator that the participantsare members of a group and the identifier for this group communicator. This is so because in reduce operation all membersof the group participate in the operation and only the process with the rank of the root process receives the result.

The third area (right) of the specification panel contains two functional buttons that are not grouped together. The topbutton contains the text label “Edit.” When this button is clicked, the label changes to “Help.” Thus this button is called“Edit/Help” button. When the “Edit” label is shown, it means that the system is on the “Edit” mode. In this mode, clickingthe buttons with replaceable micro-icons both in SD and TD groups results to showing some options where the programmercan change the example specification. When the “Help” label is shown, clicking any of the buttons in MDA results to showingthe system help in the context of the clicked micro-icon. This includes the non-replaceable micro-icons. This is done so thatthe programmer can still get information even for trivial things. The system help is very handy and easy to understand anduse. It comes in three types: animation, sound, and text. The micro-icon itself is very intuitive but if the programmer wantsto make sure of its meaning or function, he can invoke the system help easily. The help text will be rather short becauseit is limited to the meaning of the micro-icon being clicked. This is aimed at making programming rather easy because nomemorization is required. This approach is geared towards the goal of providing self-explanatory components. In fact this isjust a very small portion designed in order to achieve self-explanatory components.

The button below the “Edit/Help” button is used to record the specification found in the panel. This is called the“Set/Reset” button because it toggles between “Set” and “Reset” buttons when clicked. When the system is in the “Set”mode, clicking this button means the programmer wants to save his specification and the corresponding source code will becreated automatically. This aims to make programming easy without resorting again to text-based programming style. Whenthe button is in the “Reset” mode, all buttons in the SD and TD groups with replaceable micro-icons are not active and cannotbe replaced. This means that the “Edit” mode has been disabled. Of course, the “Help” mode is enabled and it is still possibleto view some animation, hear sound, or read text.

Figure 4 shows the corresponding source code of the specification shown in Figure 1. It contains some declaration oflocal variables and some code portions are not shown like the initialization and finalization commands when using MPI. A2-dimensional structure is saved to a 1-D structure by calling an extra function Convert2Dto1D(). It’s also possible to createa new type because MPI has a function to do that. After converting the data, it is passed to the MPI_Reduce() call. Butsince the specification in Figure 1 defines a barrier synchronization which is the one that must synchronize all processesfirst before performing the reduce operation, a call to a barrier synchronization is issued first. If Figure 1 defines a barriersynchronization where the image of the synchronizing processes is found after (to the right of) the image of a clock, then thebarrier synchronization must be called after the reduce operation. If no synchronization is specified, there will be no call to abarrier synchronization. As for the parameters of the MPI_Reduce command, the information about the number of elementsin the send buffer as well as its data type are retrieved from the structure named “INFO.” The same is true for the informationregarding the “COMM” communicator that indicates the group of processes that participates in the desired operation. Thesepieces of information are not included in the specification panel shown in Figure 1. The reason for this was already discussedabove. As for the setting of the values of the structure named “INFO,” it is done before converting the source data from 2-Darray to 1-D array. Its location is indicated by the ellipsis mark before the call to Convert2Dto1D().

4.2. AllReduce operation specification

Another type of global reduction operation is the allreduce operation. This is basically the same as the reduce operationexcept that instead of one process, i.e., the root process, receiving the result, all member processes in the group also receivethe same result. In this case, there is no need to specify the root process because all processes will receive the result. Figure5 shows an example specification for allreduce operation. It is more or less the same in appearance with the one shown inFigure 1 except for a few micro-icons found in the panel.

The discussion here will be focused on the Main Definition Area (MDA) only because the other parts of the specificationpanel have the same function as discussed above. The top non-replaceable micro-icon found in Figure 5 is different fromthat of Figure 1 because another type of global operation is being specified. As shown in Figure 5, the final configuration,i.e., represented as small boxes arranged vertically at the right side of the micro-icon, shows all processes are colored black.Besides this indicator, one can also find some lines from the result, shown at the center of the icon, connected to each of thesesmall boxes on the right with a black color. This means that all processes will receive the result of computation defined inthe specification panel. The final configuration shows that after the operation is done, all processes receive the result. Thelower right button specifies a barrier synchronization. This time the barrier will be called after the specification found in the

...MPI_Comm COMM;int X_dim, Y_dim;int gsize, ROOT, myrank;double *SBUFF, *RBUFF;...MPI_Comm_size(COMM, &gsize);MPI_Comm_rank(COMM, &myrank);

SBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));RBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));...convert2Dto1D(&sbuff, X_dim, Y_dim, &SBUFF);

MPI_Barrier(COMM);MPI_Reduce(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, ROOT, INFO->COMM);...

Figure 4. The reduce operation source code.

Figure 5. The AllReduce operation specification panel.

panel has been executed. This is indicated by the position of the image of the synchronizing processes. It is placed after (tothe right of) the image of the clock. Compare it with the one specified in Figure 1. The next discussion will focus on theTarget Definition (TD) group of MDA. The button at the bottom specifies the group of processes that will receive the resultof an allreduce operation. In this button, instead of specifying the root process as the receiver of the result, a communicatoris specified.

Figure 6 shows the corresponding source code of the specification shown in Figure 5. This source code is very similar tothe one shown in Figure 4. In fact there are only two noticeable differences. One is the placement of the call to MPI_Barrier()which is found after the call to MPI_Allreduce(). This is because a different kind of barrier synchronization is specified inFigure 5. Another difference is the parameter of the call to MPI_Allreduce(). It does not include “ROOT” as one of theparameters.

4.3. Reduce-scatter operation specification

The next global reduction operation to be discussed is the reduce-scatter operation. This is similar to the two globalreduction operations discussed in the previous subsections. The difference lies on how it handles the result of computation.Instead of sending the data to a process (reduce) or the same data to all processes (allreduce), it divides the data and scatter



MPI_Allreduce(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, COMM);MPI_Barrier(COMM);...

Figure 6. The AllReduce operation source code.

the different parts to all member processes of a certain group. Figure 7 shows the specification panel for specifying a reduce-scatter global operation. Again the discussion will be about the Main Definition Area (MDA) because this is where this paneldiffers from the other panels shown so far.

Figure 7. The Reduce-Scatter operation specification panel.

The top non-replaceable micro-icon shows the micro-icon that will be automatically displayed once the reduce-scatter typeof operation is chosen. This button is very similar to its counterpart shown in Figure 5. The first noticeable difference is theportrayal of the results of computation. Instead of a single small white box to represent a singular result, a set of small whiteboxes arranged vertically is used. Each box in this set represents the data that will be sent to a process. So there are as manysmall boxes as there are many member processes in a group. Another difference is the small geometrical shapes attachedto each line connecting the result and the receiving processes. Instead of using white squares for all lines, a distinct shapewith different colors is used for each line. This is to show that the result of computation is scattered rather than broadcast. Inscattering data, distinct data is sent to each member process in a group while in broadcasting data, the same data is sent to allmember processes of a certain group.

The next discussion is about the Source Definition (SD) group of MDA. Comparing the specification panel shown inFigure 7 with Figures 1 and 5, one can observe just one obvious difference. It is the micro-icon found at the lower rightbutton of SD. In this example specification, no barrier synchronization is used. It is indicated by the text “NONE” and theabsence of the image of the synchronizing processes. What is found is just the image of a clock. This is just to illustrate how

to specify no barrier synchronization. It should be noted that this button initially has an example barrier synchronization bydefault. The programmer can change the default micro-icon by clicking on this button. The system will supply the possiblemicro-icons shown in Figure 3 after clicking this button for him to choose. In the Target Definition (TD) group, the topbutton contains the micro-icon for the receive buffer. As can be seen in the specification panel, it shows just a slice of a 2-Dstructure. In Figures 1 and 5, the receive buffers are both a 2-D structure but in Figure 7, it is just a 1-D structure. In this typeof operation, since the result is split into n parts, the receive buffer is visualized by defining a smaller receive buffer than thecorresponding send buffer.

Figure 8 shows the corresponding source code that will be generated by the system. This is very similar to the previousexample source codes except for a few things. As can be observed, there is no call to MPI_Barrier(). This is expectedbecause there is no synchronization barrier that was specified in Figure 7. Another difference is the definition of the receivebuffer “RBUFF,” which is a 1-D structure instead of a 2-D structure.


SBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));RBUFF = (double *) malloc(X_dim*sizeof(double));...convert2Dto1D(&sbuff, X_dim, Y_dim, &SBUFF);

MPI_Reduce_scatter(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, COMM);...

Figure 8. The Reduce-Scatter operation source code.

4.4. Scan operation specification

Scan operation is a variant of the three global reduction operations discussed above. This operation is often used in parallelprefix computation as mentioned in [17, 35, 13, 18]. In this paper, this operation is also called prefix sum operation. Thespecification panel for this operation is shown in Figure 9. Comparing this with Figure 5, there are only two buttons whosemicro-icons are different. One is the top non-replaceable micro-icon and the other is the lower right button of the SourceDefinition (SD) group.

The top non-replaceable micro-icon shows the type of operation that is being defined in the given specification panel.Among the four different micro-icons used to represent the different types of global operation discussed and illustrated in thispaper, the micro-icon used in Figure 9 is very different. As can be observed, the micro-icon depicts the accumulation of data.If the micro-icon is viewed from top to bottom, where topmost level represent the lowest numbered process, one can see theaccumulation of data. Process 0 stores to its receive buffer the result, which is just the data from its send buffer. Process 1combines its data with the process 0 using the specified operator and stores the result to process 1’s receive buffer. Process 2combines its data with the data from processes 0 and 1 using the specified operator and stores the result to process 2’s receivebuffer. In general, process i combines its data with the data from process 0 to process i-1 using the specified operator.

Figure 10 shows the corresponding source code for the specification defined in Figure 9. It is very similar to the one shownin Figure 6 except for a few lines, particularly, the last two lines. The last two lines differ as to the placement of the call toMPI_Barrier(). Of course, the type of global reduction call and its parameters differ.

Figure 9. The Inclusive Scan operation specification panel.



MPI_Barrier(COMM);MPI_Scan(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, COMM);...

Figure 10. The Scan operation source code.

5. Source code generation

Source codes will be generated automatically by the system using some templates from a database. As mentioned inSection 3, the system under consideration contains six subsystems for implementing the six different groups of frames thatare being considered. Each of the six subsystems has its own set of templates and the code generation subsystem willcollect the different source codes generated by these different subsystems to make a whole program. During the collectionof different source codes from different subsystem, the code generation subsystem will try to perform some consistencychecking. If no conflicts are found, it then proceeds with the generation of an executable code.

This section just describes the source code generation part of the subsystem for specifying I/O operations and communi-cation among processes in a parallel program. After specifying a particular kind of operation by replacing some micro-iconsshown as examples in the specification panel, the programmer can record his set of specifications using the “Set” buttonprovided in every specification panel. During this operation, the system will also perform some simple consistency checkingto make sure that there will be no conflicting definitions. If there are no conflicting definitions, this subsystem proceeds togenerate a source code for the given specification from a template database. The source code that will be generated will besaved for the code generation subsystem to combine with other codes generated from the specifications made under othersubsystems to make a complete program. Then the complete program will be compiled to generate an executable code ofthe entire program. The programmer does not have to go back to the text-based programming style and does not need to

memorize the syntaxes and semantics of the language as he does in text-based programming environment. At present, thesubsystem for code generation is not yet fully operational but is underway. So there are no full-fledge results presented butthere will be soon. For the implementation of the message-passing type of communication, the existing C version of the MPI[24, 7] library is used. For more information about the code generation subsystem of “Active Knowledge Studio,” please see[9].

6. Conclusion and future work

A visual tool or environment for specifying global reduction operations using the message-passing paradigm has beenpresented. The master/slave scheme of parallel computation was used as the model to illustrate the specification of these op-erations. There were no novel algorithms introduced to improve the current state of problem-solving, especially for scientificcomputing. Also, it did not present any experiment with existing systems. What was described was just a part of a visualprogramming tool or environment that aimed at making programming, especially for scientific computing, a bit easier. Whatwas presented was just an approach that aimed at providing a visual support for the existing MPI library. An overview of adifferent technique in making parallel programs using algorithmic “films” was also discussed. In this technique, a program-mer can navigate the films either by scenes or by stills and manipulate the stills of a film inside some specification panel byattaching operations using a language of micro-icons. Every panel is supplied with example intuitive micro-icons that willserve as guides to the programmer. He will just perform a simple fine-tuning of the example micro-icons to specify any ofthe global reduction operations. Moreover, if the intuitive meaning of the micro-icons is not enough, the system provides asimple to understand help that comes in three forms: animation, sound, and text.

What we have presented is just a part of our on-going project called “Active Knowledge Studio.” There are still manythings to be done in the future. Once the entire system is fully operational, some experiments will be conducted suchas evaluating its effectiveness as a tool, determining whether or not this can be used in making real complex scientificapplications, and validating whether or not programmers really find this environment easy to use and understand.

References

[1] O. Babaoglu, L. Alvisi, A. Amoroso, R. Davoli, and L. A. Giachini. Paralex: An environment for parallel programming in distributedsystems. In Proceedings of the 6th International Conference on Supercomputing, pages 178–187. ACM Press, 1992.

[2] A. Beguelin, J. Dongarra, G. A. Geist, R. Manchek, K. Moore, P. Newton, and V. Sunderam. HeNCE: A Users’ Guide Version 2.0,July 1994. Available online at: http://www.netlib.org/hence/ HeNCE-2.0-doc.ps.gz.

[3] W. Cai, B. Lee, and J. Zhou. Causal order delviery in a multicast environment: An improved algorithm. Journal of Parallel andDistributed Computing, 62(1):111–131, Jan. 2002.

[4] C. D. Carothers, B. Topol, R. M. Fujimoto, J. T. Stasko, and V. Sunderam. Visualizing parallel simulations in network computingenvironments: A case study. In S. Andradottir, K. J. Healy, D. H. Withers, and B. L. Nelson, editors, Proceedings of the 1997 WinterSimulation Conference, pages 110–117, 1997.

[5] D. Chappell and J. H. Trimble, Jr. A Visual Introduction to SQL. John Wiley & Sons, Inc., second edition, 2002.[6] I. F. Cruz. Doodle: A visual language for object-oriented databases. In Proceedings of the 1992 ACM SIGMOD International

Conference on Management of Data, pages 71–80. ACM Press, June 1992.[7] J. Dongarra, S. W. Otto, M. Snir, and D. Walker. A message passing standard for mpp and workstations. Communications of the

ACM, 39(7):84–90, July 1996.[8] C.-R. Dow, S.-K. Chang, and M. L. Soffa. A visualization system for parallelizing programs. In Proceedings of the 1992 ACM/IEEE

Conference on Supercomputing, pages 194–203. IEEE Computer Society Press, Dec. 1992.[9] T. Ebihara, R. Yoshioka, and N. N. Mirenkov. Program generation from film specifications. In Proceedings of the Sixth IASTED

International Conference on Software Engineering and Applications, pages 403–410. Acta Press, Nov. 2002.[10] M. Erwig. Xing: a visual xml query language. Journal of Visual Languages and Computing, 14(1):5–45, Feb. 2003.[11] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User’s Guide and

Tutorial for Networked Parallel Computing. MIT Press, 1994.[12] A. Guercio, T. Arndt, and S.-K. Chang. A visual editor for multimedia application development. In Proceedings of the 22nd

International Conference on Distributed Computing Systems Workshops, pages 296–301. IEEE Computer Society, July 2002.[13] W. D. Hills and G. L. Steele, Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, Dec. 1986.[14] C.-H. Hu and F.-J. Wang. Towards a practical visual object-oriented programming environment: Desirable functionalities and their

implementation. Journal of Information Science and Engineering, 15(4):585–614, July 1999.[15] P. Kacsuk, G. Dozsa, and T. Fadgyas. Designing parallel programs by the graphical language grapnel. Microprocessing and Micro-

programming, 41(8-9):625–643, Apr. 1996.

[16] A. D. Kshemkalyani and M. Singhal. Necessary and sufficient conditions on information for causal message ordering and theiroptimal implementation. Distributed Computing, 11(2):91–111, 1998.

[17] R. E. Ladner and M. J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831–838, Oct. 1980.[18] Y.-C. Lin and C.-S. Yeh. Optimal parallel prefix on the postal model. Journal of Information Science and Engineering, 19(1):75–83,

Jan. 2003.[19] P. Maresca, T. Arndt, A. Guercio, and P. Donadio. An enhanced visual editor for multimedia software engineering. In Proceedings

of the 8th International Conference on Distributed Multimedia Systems, pages 568–572. Knowledge Systems Institute, Sept. 2002.[20] T. F. Masterson and R. M. Meyer. Sivil: A true visual programming language for students. The Journal of Computing in Small

Colleges, 16(4):74–86, May 2001.[21] A. L. McKinney. A recent radical graphical approach to programming. The Journal of Computing in Small Colleges, 18(26):28–35,

June 2003.[22] N. Mirenkov, A. Vazhenin, R. Yoshioka, T. Ebihara, T. Hirotomi, and T. Mirenkova. Self-explanatory components: A new program-

ming paradigm. International Journal of Software Engineering and Knowledge Engineering, 11(1):5–36, Feb. 2001.[23] Y. Moses, Z. Polunsky, A. Tal, and L. Ulitsky. Algorithm visualization for distributed environments. In IEEE Symposium on

Information Visualization, pages 71–78. IEEE Computer Society, Oct. 1998.[24] MPI. MPI: A Message-Passing Interface Standard, June 1995. Available online at: http://www.mpi-forum.org/docs/ mpi-11.ps.[25] P. Newton and J. C. Browne. The code 2.0 graphical programming language. In Proceedings of the 6th International Conference on

Supercomputing, pages 167–177. ACM Press, 1992.[26] P. Newton and J. Dongarra. Overview of vpe: A visual environment for message-passing parallel programming. Technical report,

University of Tennessee at Knoxville, Nov. 1994. Available online at: http://www.cs.utk.edu/˜library/TechReports/1994/ut-cs-94-261.ps.Z.

[27] P. Newton and S. Y. Khedekar. CODE 2.0 User and Reference Manual, Mar. 1993. Available online at: http://www.cs.utexas.edu/users/code/download/ CODE_2.0_Manual.ps.

[28] J. J. Pfeiffer, Jr., R. L. Vinyard, Jr., and B. Margolis. A common framework for input, processing, and output in a rule-based visuallanguage. In Proceedings of the 2000 IEEE International Symposium on Visual Languages, pages 217–224. IEEE Computer SocietyPress, 2000.

[29] H. Ranriamparany, B. Ibrahim, and H. Liesenberg. Parallel application specification with the diva system. In ACIS First InternationalConference on Software Engineering Applied to Networking, and Parallel/Distributed Computing, pages 35–42, May 2000. Availableonline at: ftp://cui.unige.ch/VISUAL PROG/articles/snpd00.pdf.gz.

[30] R. R. Roxas and N. M. Mirenkov. A visual approach to specifying message-passing operations. In Proceedings of the InternationalConference on Parallel Processing Workshops. IEEE Computer Society, Oct. 2003. To appear.

[31] R. R. Roxas and N. M. Mirenkov. Visual input/output specification in different modes. In Proceedings of the International Conferenceon Advanced Information Networking and Applications, pages 185–188. IEEE Computer Society, Mar. 2003.

[32] R. R. Roxas and N. N. Mirenkov. Visualizing external inter-component interfaces. In Proceedings of the 22nd International Confer-ence on Distributed Computing Systems Workshops, pages 290–295. IEEE Computer Society, July 2002.

[33] R. R. Roxas and N. N. Mirenkov. Visualizing input/output specification. In Proceedings of the 8th International Conference onDistributed Multimedia Systems, pages 660–663. Knowledge Systems Institute, Sept. 2002.

[34] S. R. Sarukkai and D. Gannon. Parallel program visualization using sieve.1. In Proceedings of the 6th International Conference onSupercomputing, pages 157–166. ACM Press, May 1992.

[35] D. B. Skillicorn and D. Talia. Models and languages for parallel computation. ACM Computing Surveys, 30(2):123–169, June 1998.[36] N. Stankovic, D. Kranzlmuller, and K. Zhang. The pcg: An empirical study. Journal of Visual Languages and Computing, 12(2):203–

216, Apr. 2001.[37] N. Stankovic and K. Zhang. Visual programming for message-passing systems. International Journal of Software Engineering and

Knowledge Engineering, 9(4):397–423, 1999.[38] N. Stankovic and K. Zhang. A distributed parallel programming framework. IEEE Transactions on Software Engineering, 28(5):478–

493, May 2002.[39] J. Stasko, J. Dominique, M. H. Brown, and B. A. Price, editors. Software Visualization - Programming as a Multimedia Experience.

The MIT Press, 1998.[40] J. Tao, W. Karl, and M. Schulz. Memory access behavior analysis of numa-based shared memory programs. Scientific Programming,

10(1):45–53, 2002.[41] A. P. Vazhenin, N. N. Mirenkov, and D. A. Vazhenin. Multimedia representation of matrix computations and data. Information

Sciences, 141(1-2):97–122, Mar. 2002.[42] L. Weaver and L. Robertson. Java Studio by Example. The Sun Microsystems Press, 1998.[43] G. Wirtz. Modularization and process replication in a visual parallel programming language. In IEEE Symposium on Visual Lan-

guages, pages 72–79, Oct. 1994.[44] R. Yoshioka and N. Mirenkov. Visual computing within environment of self-explanatory components. Soft Computing, 7(1):20–32,

2002.[45] R. Yoshioka, N. Mirenkov, Y. Tsucida, and Y. Watanobe. Visual notation of film language system. In Proceedings of the 8th

International Conference on Distributed Multimedia Systems, pages 648–655. Knowledge Systems Institute, Sept. 2002.

Session 5: System Architectures (2)

V LaTTe: A Java Just-in-Time Compiler for VLIW

with Fast Scheduling and Register Allocation∗

Suhyun Kim, Soo-Mook Moon Kemal Ebcioglu, Erik Altman

[email protected] [email protected]

School of Electrical Engineering IBM T. J. Watson Research Center

Seoul National University, Korea Yorktown Heights, NY 10598

Abstract

For network computing on desktop machines, fast execution of Java bytecode programs isessential because these machines are expected to run substantial application programs writtenin Java. We believe higher Java performance can be achieved by exploiting instruction-levelparallelism (ILP) in the context of Java JIT compilation. This paper introduces V LaTTe,a Java JIT compiler for VLIW machines that performs efficient scheduling while doing fastregister allocation. It is an extended version of our previous JIT compiler for RISC machinescalled LaTTe [1] whose translation overhead is low (i.e., consistently taking one or two secondsfor SPECJVM98 benchmarks) due to its fast register allocation. V LaTTe adds the schedulingcapability onto the same framework of register allocation, with a constraint for precise in-order exception handling which guarantees the same Java exception behavior with the originalbytecode program. Our preliminary experimental results on the SPECJVM98 benchmarksshow that V LaTTe achieves a geometric mean of useful IPC 1.7 (2-ALU), 2.1 (4-ALU), and2.3 (8-ALU), while the scheduling/allocation overhead is 3.6 times longer than LaTTe’s onaverage, which appears to be reasonable.keyword: Java virtual machine, Just-in-Time compilation, VLIW, scheduling, register allo-cation, out-of-order exception handling

∗This work has been supported by IBM T. J. Watson Research Center under a sponsored research agreement.

1 Introduction

Java has rapidly emerged as a commercially successful programming language that is being used

for everything from embedded systems to enterprise servers as well as for applets [2]. One of its

success factors comes from the platform independence achieved by using the Java virtual machine

(JVM), a program that executes Java’s compiled executable called bytecode. The bytecode is

a stack-based instruction set which can be executed by interpretation on all platforms without

porting the original Java source code. Since this software-based execution is obviously much

slower than hardware execution, performance issues have constantly been raised for Java.

We believe that higher Java performance can be achieved by instruction-level parallelism

(ILP) compiler and hardware technology. In particular, we are interested in exploiting ILP in the

context of Java just-in-time (JIT) compilation, targeting a VLIW machine. The JIT compiler

is a component of the Java Virtual Machine (JVM) which translates Java’s bytecode into native

machine code immediately prior to its execution so that the Java program runs as a real executable.

Since the machine code is “cached” in the JVM, it can be used again without re-translation. We

want to exploit ILP in Java programs by translating the bytecode into VLIW code.

In order to translate the bytecode into VLIW code, it is simple to translate the bytecode into

pseudo RISC code with symbolic registers first, followed by scheduling and register-allocating the

RISC code into VLIW code. The problem is how to generate high-performance VLIW code effi-

ciently. Since the JIT compilation time is part of the whole running time, scheduling and register

allocation should be done quickly, preferably in a single integrated phase, i.e., software pipelining

followed by graph-coloring register allocation with coalescing would be the last techniques that

we want to use. This requires a trade-off between the quality and speed of scheduling and register

allocation, which poses a challenging engineering and research problem.

This paper introduces V LaTTe, a Java JIT compiler for VLIW machines that performs

fast scheduling with efficient register allocation. It is an extended version of our previous JIT

compiler for RISC machines with fast register allocation called LaTTe [1], extended with a

scheduling capability. The bytecode is first translated into pseudo RISC code where many copies

corresponding to pushes and pops between local variables and the stack are generated. V LaTTe

1

performs global scheduling while performing register allocation that coalesces most of these copies

with a local lookahead. In fact, V LaTTe integrates scheduling and register allocation seamlessly in

a unified framework. Since LaTTe JVM handles many of the Java runtime exceptions using traps

[1], V LaTTe schedules instructions with a constraint for precise out-of-order exception handling

in order to guarantee the same Java exception behavior as the original bytecode.

The rest of the paper is organized as follows. Section 2 briefly reviews the translation of the

bytecode into pseudo RISC code. Section 3 describes the fast scheduling and register allocation

technique of V LaTTe which transforms the pseudo RISC code into VLIW code. Section 4 presents

experimental results to evaluate V LaTTe. A summary follows in Section 5.

2 Translation of Bytecode into Pseudo RISC Code

When a Java method is called for the first time, VLaTTe translates its bytecode into VLIW

code. The translation process is composed of five stages. In the first stage, all control join points

and subroutines in the bytecode are identified via a depth-first traversal. In the second stage,

the bytecode is translated into a control flow graph (CFG) of pseudo RISC instructions with

symbolic registers and all final/static/private methods are inlined. In the third stage, V LaTTe

performs scheduling and fast register allocation, generating VLIW code. Finally, the VLIW code

is converted into simulation code and is executed to obtain outputs and statistics. In this section,

we briefly review the translation of bytecode into pseudo RISC code.

2.1 Java Virtual Machine and Bytecode

The Java VM is a typed stack machine [3]. Each thread of execution has its own Java stack where

a new activation record is pushed when a method is invoked and is popped when it returns. An

activation record includes state information, local variables, and the operand stack. All compu-

tations are performed on the operand stack and temporary results are saved in local variables,

so there are many pushes and pops between the local variables and the operand stack. All data

manipulated in the JVM are typed, either by primitive types or by reference types of Java objects.

The bytecode instruction set includes instructions for stack manipulation, local variable access,

2

array access, and object access. It also includes arithmetic, logical, and conversion instructions

which take source operands off the stack and push the computation result back on the stack. For

flow control, there are instructions for method invocation, subroutines, branches, and returns.

Most instructions specify the type of data to manipulate in their opcodes.

2.2 Translation of Bytecode into Pseudo RISC Instructions

We briefly review the basic idea used when translating bytecode instructions into RISC primi-

tives with symbolic registers (we used SPARC primitives). The translation rule for each bytecode

instruction is determined based solely on the type and the opcode of the instruction itself. When

the code fragments generated for independently translated bytecode instructions are concate-

nated, the result is correct, since consistent formats are used for symbolic registers, especially for

those registers corresponding to the stack elements which are formed based on a translation-time

variable, TOP. A symbolic register in the pseudo RISC code is composed of three parts:

• The first character indicates the type: a = address (object reference), i = integer, f = float,

l = long, and d = double

• The second character indicates the location: s = operand stack, l = local variable,

t = temporaries generated for JIT translation purposes

• The remaining number further distinguishes the symbolic register

For example, al0 represents a local variable indexed by 0 whose type is an object reference. is2

represents the second item of the operand stack whose type is an integer.

TOP is a translation-time variable used by (V)LaTTe (not a value computed at run-time) which

indicates the number of items on the operand stack just before translating the current bytecode

instruction. For example, if the current value of TOP is 4, “add isTOP-1, isTOP, isTOP-1”

means “add is3, is4, is3”. There is another translation-time array, type[1..TOP] which indicates

the type of each item (one of a,i,f,l,d) currently on the stack (required for translating dup/pop).

The bytecode of a method is traversed in depth-first order, starting at the beginning of the

method with TOP set to zero. Following any path of the bytecode, when a bytecode instruction

3

that pushes some item(s) on the stack is encountered, TOP is incremented by the number of

pushed items. Similarly, when a bytecode instruction that pops some item(s) is encountered, TOP

is decremented by the number of popped items. The type array type[] is appropriately updated

by the type of pushed items. For example, bytecode instructions that push a local variable onto

the operand stack or pop the stack top into a local variable are translated into symbolic register

copies as follows ($ means a translation-time action).

iload n // stack: ... => ..., (value of local variable n)mov iln, isTOP+1 // means a copy "isTOP+1 = iln"$TOP=TOP+1 // the stack now has one more item.$type[TOP]=‘I’ // the type of the new item is integer.

astore n // stack: ..., (object reference) => ...mov asTOP, aln$TOP=TOP-1 // the stack now has one less item.

It should be noted that these symbolic register copy instructions do not really generate code

because they will be coalesced away during the scheduling/allocation phase.

The arithmetic/logical/shift bytecode instructions that operate on the top items of the operand

stack can be directly mapped to one or two pseudo instructions.

iadd // stack: ..., x, y => ..., (x+y)add isTOP-1, isTOP, isTOP-1 // means "isTOP-1 = isTOP-1 + isTOP"$TOP=TOP-1 // the stack now has one less item.

For instance variable access, it is done by a single memory access since instance variables are

included in the object (along with a pointer to the method table and a 32-bit lock). Here is an

example pseudo code for reading an integer instance variable foo from an object x.

getfield <x.foo> // stack: ..., (object ref) => ..., (integer)ld [asTOP + foo_offset], isTOP // ‘‘isTOP = load @[asTOP+foo_offset]’’$type[TOP]=‘I’ // foo_offset is a constant

The JVM is required to throw a NullPointerException if the object reference is NULL. LaTTe

does not generate such check code here because if the object reference is NULL, a SIGSEGV or

SIGBUS signal will be raised by the OS during the execution of the load/store; the LaTTe JVM

includes a signal handler where the NullPointerException is thrown.

Other bytecode instructions require more elaborate translation, considering exception han-

dling, object layout, and the calling conventions. More details can be found in [1].

4

2.3 A Translation Example

Figure 1 shows a simple translation example from bytecode into pseudo RISC code.

if( b > c ) b = b - c; else b = b + c; tmp = a; a = b; b = tmp;

18 iload_217 iload_0

20 istore_021 iload_122 iload_223 if_icmple 3326 iload_127 iload_2

19 iadd

29 istore_130 goto 4333 iload_134 iload_2

28 isub

36 istore_137 iload_038 istore_339 iload_140 istore_041 iload_342 istore_143 ...

a = a + c;

35 iadd

...

(c) Pseudo code(b) Bytecode

TF

17: mov il0’,is5 18: mov il2,is6 19: add is5’,is6’,is5 20: mov is5’,il0 21: mov il1,is5 22: mov il2,is6 23: sle is5’,is6,ctp17 23: jset ctp17’, 33

27: mov il2,is6 28: sub is5’,is6’,is5 29: mov is5’,il1

33: mov il1’,is5 34: mov il2,is6 35: add is5’,is6’,is5 36: mov is5’,il1 37: mov il0’,is5 38: mov is5’,il3

40: mov is5’,il0 41: mov il3’,is5 42: mov is5’,il1

39: mov il1’,is5

43: ... Region B

Region A

(d) CFG of Pseudo Code

26: mov il1’,is5 a,b,c live & tmp dead

(a) Java Source

17: mov il0,is5 18: mov il2,is6 19: add is5’,is6’,is5 20: mov is5’,il0 21: mov il1,is5 22: mov il2,is6 23: sle is5’,is6,ctp17 23: jset ctp17’, 33 26: mov il1,is5 27: mov il2,is6 28: sub is5’,is6’,is5 29: mov is5’,il1 30: (goto 43)33: mov il1,is5 34: mov il2,is6 35: add is5’,is6’,is5 36: mov is5’,il1 37: mov il0,is5 38: mov is5’,il3 39: mov il1,is5 40: mov is5’,il0 41: mov il3’,is5 42: mov is5’,il1 43: ...

Figure 1: A translation example from bytecode into pseudo code; last uses are marked by ’.

3 Fast Scheduling and Register Allocation

The translation rules described above indicate that it is straightforward to convert the bytecode

into RISC code with symbolic registers. We now describe a fast, global scheduler with local

register allocation which schedules operations aggressively while coalescing copies efficiently, thus

conserving registers. The technique is based on list scheduling integrated with the left-edge greedy

interval coloring algorithm [4], extended to a larger region of code called the tree region.

3.1 Tree Regions

The CFG of pseudo RISC code is partitioned into tree regions which are single-entry, multiple-exit

subgraphs shaped like trees (same as extended basic blocks). Tree regions start at the beginning

of the program or at control join points and end at the end of the program or at other join points.

For example, the following CFG, composed of three basic blocks (A,B,C), includes two tree regions:

Region (1) starts at label A: and ends at the flow graph edge corresponding to goto B, while

5

A: ... goto B | Region (1) A: ... goto BB: ... if cc goto B else goto C | Region (2) B: ... if cc goto B else goto CC: ... return | C: ... return

region (2) starts at label B: and ends at the flow graph edges corresponding to goto B and return.

A tree region is the unit of optimizations in V LaTTe. Tree regions can be enlarged by code

duplication techniques such as loop unrolling to increase the opportunity of optimizations in

frequently executed parts of code (we have not exploited this yet). By working on tree regions,

V LaTTe trade-offs between quality and speed of scheduling and register allocation.

After regions are constructed for a method, last uses of symbolic registers are computed. For

stack or temporary symbolic registers, they are supposed to be dead once they are used in a

translated instruction, or their live range cannot span beyond the translated code sequence for a

bytecode; consequently, last uses of these symbolic registers can be readily identified. For local

symbolic registers, however, liveness is computed “approximately” via a single post-order traversal

of the regions such that every local symbolic register is assumed to be live on a backward edge of

the CFG. This is a conservative (upper bound) yet fast estimation of live variables at each region.

Based on this information, we can identify the last use of each local symbolic register (marked by

’ in Figure 1). When a local symbolic register becomes dead on one path of a conditional branch

while it is live on the other path, we mark the last use of it on the path where it becomes dead.

The regions are then scheduled with register allocation one by one in reverse post-order traver-

sal of the CFG of regions. In each region during the traversal, the operations of the tree itself is tra-

versed twice, first by a post-order traversal called the backward sweep, followed by a topologically-

sorted order of traversal called the forward sweep. The backward sweep collects information on

the preferred destination registers for instructions (which works as a local lookahead) while the

forward sweep performs actual scheduling and register allocation using that information, creating

VLIWs. During each traversal, a map which is a set of (symbolic, real) register pairs is collected

and propagated following the traversal direction. The map is called p map in the backward sweep

which describes preferred assignment for destination registers, and h map in the forward sweep

which describes the current register allocation result.

6

3.2 Backward Sweep

The backward sweep() algorithm in Figure 2 is called with the root of the region as an argument.

The purpose of the backward sweep is to compute the p map for destinations of instructions, based

on the required register assignment at the end of the region (if any) or at method calls/returns

according to the calling convention. For example, if a symbolic register il2 is to be allocated by

a real register r at the end of a region (e.g., due to the prior register allocation in other regions

that reach the same region boundary) and we have the operation “add is1, is2, is1” followed by

“mov is1, il2” just before the end of the region, then the destination is1 of the add is preferred

to be allocated to r to avoid a copy (saved at the preferred assignment field of instructions).

backward_sweep (instr) returns p_map // set of (symbolic,real) register pairsif (instr is the header of a next region) // reached a region boundary

p = NULL;if (there is an old h_map, h_old, saved at instr) // the region boundary has been visited

p = h_old; // previously via a prior forward sweep in other regionsreturn p;

p = backward_sweep(NextInstr(instr));if (instr has more than one successor)

for (each successor s of instr other than NextInstr(instr)) p = merge_map(p, backward_sweep(s));

// Now, compute the p_map above instrif (instr has a target symbolic register x AND there is a preferred assignment (x,r) in p)

preferred_assignment(instr) = r;if (instr is a copy x=y )

insert (y,r) into p;delete (x,r) from p; // (x,r) is not propagated above instr

else if (instr is a call)

insert (argument symbolic register, out register) pairs into p;delete (returned value symbolic register, *) from p;

else if (instr is a return)

insert (is0, %i0) into p; // adding (returning value symbolic register, return register) into preturn p;

Figure 2: The backward sweep algorithm.

If the preferred map for x is r under a copy “mov y, x”, then the preferred map for y becomes

7

r above the copy. At a conditional branch, the preferred map of both paths are unioned, yet if

there are two different preferred assignments for a symbolic register, the one coming from the

more likely path is taken (when there is no profiling done, an arbitrary one is taken).

3.3 Forward Sweep

After the preferred assignments for instructions are computed, a forward sweep is performed to

schedule the region while allocating real registers. The earliest scheduling time of each operation

is determined using an array called avail[] while register allocation is performed using the h map.

The right-hand-side of each operation is readily allocated with h map, and searching for the target

register occurs concurrently with finding the VLIW where the operation will be scheduled.

The schedule() algorithm in Figure 3 is called at the root of the region with h, an h map that

is saved at the root. From h, we can determine refcount that shows how many symbolic registers

are mapped to each real register and freereg which indicates the set of real registers to which

no symbolic registers map. For the first region at the beginning of a method, h is initialized by

the map of parameters according to the calling convention. As regions are allocated in reverse

post-order, h at the end of a region is propagated to the beginning of the next region.

When scheduling a tree region, we actually create a tree of tree VLIWs [5] during the forward

sweep. That is, we visit each instruction in topologically sorted order (the exact order will be

determined by priorities of the paths in the region which will be described shortly) and schedule it

in the earliest possible VLIW with its source and target operands being allocated to real registers.

Let us first describe the register allocation part of the forward sweep. When an instruction

z=x+y is encountered, the registers are allocated as follows. First, the right-hand-side (RHS) is

allocated simply as h[x]+h[y] no matter where it is scheduled. If x is the last use, the refcount of

the real register h[x] is decremented by one, and h[x] is removed from the liveRegs if the refcount

becomes zero, and (x, h[x]) is deleted from h (meaning that x is now dead). We do the same for y.

For the target operand z, if the instruction is a copy z=x and x was mapped to a real register

r, then we just update h by allocating z to r (h[x]=r), meaning that the copy is coalesced and

does not generate any code. Otherwise, we have to allocate a real register for z, but we need to

decide first where to schedule the instruction before we choose the real register (unlike its source

8

schedule( root_instr, given_h )

init_path = create_path(continuation=root_instr, prob=1.0, pathLength=1,avail="all 0s",h=given_h,refcount,liveRegs(computed from h));

add init_path to pathlist;

while( path_list != NULL ) )p = First item in pathlist; // Most probable pathINSTR = path.continuation;

if( IsRegionEnd(INSTR) )// Generate copy ops at end of a region to match register mappingsif( INSTR does not have h_old assigned yet ) h_old(INSTR) = h;else GenerateCopyOps( h, h_old(INSTR) );

remove p from pathlist;continue;

schedule_an_instr( INSTR, p );

// update p & pathlistif( INSTR has multiple successors )

remove p from pathlist;for( Each successor s of INSTR )

pathlist = insert_path_in_prob_order( clone_path(p,s), pathlist );// clone_path(p,s) makes a copy of path p continuing at s. Also:// 1- probability of new path is set to orig. probability * prob(p,s)// 2- h, refcount, liveRegs of a path are updated// based on symbolic registers dead at s

else p.continuation = next_instr( INSTR ); // just continue with next op

Figure 3: The forward sweep algorithm.

operands). When the VLIW to place the instruction is determined, we check if there is a preferred

assignment for it (a real register that z will eventually be mapped into) and if it is dead at that

VLIW. If so, we choose the register. Otherwise, we choose any dead register at that VLIW (if

every register is live, we need to perform a spill). Now, the pair (z, the chosen register) is inserted

into h and the updated h will be used for register allocation of the next instruction.

Starting from the root of a region, all instructions are register-allocated as described above

while being scheduled. When the header (root) of the next region is encountered, we save the

current h map at the root so that the forward sweep at the next region can start with an initial

h map. Since the header is a join point, more than one forward sweep reaches the same header.

If no h map is saved at the header when the current forward sweep reaches it, we simply save the

9

current h map there. Otherwise, we need to reconcile the current and the old h map by inserting

some copies if needed. The reconciling problem, in fact, is similar to replacing SSA φ nodes by a

set of equivalent move operations [6] and we can use the same solution to minimize those copies1.

schedule_an_instr( INSTR, p )

Parse INSTR as "z=x+y";real_rhs = "p.h[x]+p.h[y]";OldLiveRegs = p.liveRegs; // checkpoint liveRegs to detect change

// check last usesif( x is a last use )

if( --p.refcount[p.h[x]] == 0 ) p.liveRegs -= p.h[x];p.h[x] = NULL;

if( There is a second operand y != x && y is a last use )

if( --p.refcount[p.h[y]] == 0 ) p.liveRegs -= p.h[y];p.h[y] = NULL;

// Copy operations do not generate code, just change h mapif( INSTR is a copy z=x )

p.h[z] = real_rhs; // i.e. the old p.h[x]p.refcount[p.h[z]]++;p.liveRegs += p.h[z];return;

earliest, vliw_list = get_sched_time( INSTR, p, OldLiveRegs );insert_ins_in_sched( p, INSTR, vliw_list, earliest, real_rhs );

Next we describe the scheduling part of the forward sweep. The algorithm schedule() maintains

a priority queue called pathlist for currently open paths in the tree region where there is scheduling

going on. Instructions are visited and scheduled at the end of the currently most probable path

(where probabilities are obtained through profiling, or just guessed). Initially, the algorithm starts

with a single path. As a conditional branch is scheduled, the current path is cloned for the two

paths and are inserted into the pathlist in order of probability (our experiments do not use it).

For a given operation z=x+y encountered during the forward sweep, we allocate its RHS as h[x]

+ h[y] as described previously. Then we find the first VLIW on the current path starting from

the root VLIW of the region where x is ready and the first VLIW where y is ready, and takes

the later one among them (each path maintains an array avail that maps each physical register

1These copies can also be scheduled in VLIWs where source registers are available and target registers are free.

10

to the first VLIW on this path where the register is ready). Let us call this VLIW the earliest

VLIW. The operation will be scheduled in one of VLIWs between the earliest VLIW and the last

VLIW on a given path.

get_sched_time( INSTR, p, OldLiveRegs )

// Make an array of VLIWs from the root to the end pathvliw_list = make_vliw_array( p.curr_vliw );

// Get earliest VLIW from data dependences of source operands.earliest = MAX( p.avail[ h[INSTR’s each source] ] );

// Check if allowed to schedule earlier than the last VLIWif( INSTR has multiple branch targets || INSTR is a store )earliest = MAX( earliest, p.pathLength - 1 );

if( OldLiveRegs != p.liveRegs )// Live real regs changed ==> recompute live variablesvliw_list[p.pathLength].liveRegs = p.liveRegs;for( time = p.pathLength-1; time > earliest; time-- )

l = compute_live( vliw_list[time] );// Stop when no more changesif( vliw_list[time].liveRegs == l ) break;vliw_list[time].liveRegs = l;

// Compute real regs that are free until the endfree_till_end = all 1s;for( time = p.pathLength; time > earliest; time-- )

free_till_end &= ~vliw_list[time].liveRegs;vliw_list[time].free_until_end = free_till_end;

return earliest, vliw_list;

Before we schedule the operation at some VLIW, we need to decide the real register for its

target operand, which should be dead at that VLIW. Each VLIW maintains a set of live registers

to help the decision. The problem is that some of these currently live registers might become

dead when we move up an operation whose RHS allocation results in newly dead registers (when

the RHS includes last uses); the newly dead register can also be used as a target register for the

operation being scheduled. In order to obtain this information locally even before we actually

move up the operation, the scheduling proceeds as follows. If any register has gone dead when

allocating the RHS of the current operation, the algorithm recalculates the live real registers of

the preceding VLIW on the current path (using new dead registers), then does the same for the

11

VLIW just before that, until a VLIW is encountered whose live registers do not change, or the

earliest VLIW is encountered. Then the algorithm walks forward on that path from the earliest

VLIW until it finds a VLIW where there is a functional unit available and a dead register r that

can be allocated to the target z and is not live until the end of the current path (i.e., a dead

register at that VLIW is not necessarily dead until the end of the path). The operation r = h[x]

+ h[y] is inserted at that VLIW.

Then we do the final cleanup. h[z] is updated into r and r is made to be live between the

VLIW where the operation is scheduled and the last VLIW on the current path. Also, the real

registers of the RHS between the earliest VLIW and the VLIW where it is actually scheduled.

The place where z is ready (avail[h[z]]) is set to the VLIW that is away by the latency of the

operation from the VLIW where it is scheduled. Stores, traps, and branches are always scheduled

in the last VLIW on a given path.

3.4 Scheduling Constraints for Java Exceptions

V LaTTe catches Java runtime exceptions such as NullPointException using traps (exceptions).

This imposes a constraint on the scheduling algorithm that the behavior of Java exceptions in the

program should not be affected. For potentially exceptional instructions such as loads, this means

they cannot be scheduled speculatively above a branch or they cannot be reordered, which would

severely constrain the scheduling. In order to achieve a high ILP while guaranteeing in-order

checking of exceptional consitions, we have provided out-of-order execution support via exception

tags added to each register, out-of-order version of each opcode that may raise an exception, and a

commit operation that checks the exception tags. We have also enhanced the scheduling algorithm

for the hardware support (more precise than in [7, 8]). This topic is beyond the scope of this

paper and will be discussed elsewhere. Currently, V LaTTe guarantees identical Java exception

behavior with the original bytecode program.

3.5 A Scheduling and Register Allocation Example

Figure 4 describes the scheduling and register allocation process in detail for the example in

Figure 1. The final VLIW schedule is shown in Figure 5. The example shows that the original

12

insert_ins_in_sched( pathlist, p, INSTR, vliw_list, earliest, real_rhs )

// Go from earliest to latest to find the first VLIW where operation fitsreal_lhs = NULL;for( time = earliest; time < p.pathLength; time++ )

vliw = vliw_list[time];nextvliw = vliw_list[time+1];

if( vliw has no resource for INSTR ) continue;if( INSTR has a dest reg z )

if( INSTR has a preferred assignment r for its destination z,AND r is in nextvliw.free_until_end )real_lhs = r;break;

else if( nextvliw.free_until_end != NULL )real_lhs = Smallest free real register in nextvliw.free_until_end;break;

else continue; // resources OK, but dest reg not available else break; // resources OK, and dest reg not needed

// Append new VLIW if operation did not fit anywhereif( time == p.pathLength )append_list( vliw_list, create_new_vliw( p.pathLength++ ) );

schedule "real_lhs = real_rhs" in vliw_list[time];for( each register s in real_rhs )for( Each VLIW v, avail[s] < v <= time )Add s to v.liveRegs;

for( Each VLIW v, time+1 <= v <= lastvliw+1 )Add the dest reg. real_lhs (if present) to v.liveRegs;

p.h[z] = real_lhs;p.refcount[p.h[z]]++;p.liveRegs += p.h[z];p.avail[p.h[z]] = time+op_latency;

bytecode can be executed by only two VLIWs and most of the copies in the initial pseudo code

are coalesced away. The remaining copies are necessary since there is swapping of two variables.

4 Preliminary Experimental Results

We have implemented V LaTTe on the SPARC platform. The current version used for experiments

has not been fully optimized and tuned. In particular, it does not employ any object-oriented

(OO) optimization techniques such as customization or specialization combined with inlining,

which will be useful for increasing the ILP opportunity (currently, the number of instructions in

a tree region is too small due to the OO feature of small method size). Also, the JIT translation

13

Pseudo Code Scheduling Actions======================= =============================

Start:h[v]=r means symbolic register vis mapped to hardware register r.v’ indicates last use of v.Assume initial mapping ash[il0]=r0, h[il1]=r1, h[il2]=r2.

17: mov il0’,is5 h[is5]=h[il0]=r018: mov il2,is6 h[is6]=h[il2]=r219: add is5’,is6’,is5 h[is5]=r0,h[is6]=r2 are ready in VLIW 1

is5 is last use of r0, free h[is5]=r0Assign r0 (first free reg in VLIW 1)to dest is5, New h[is5]=r0

==> Schedule add r0,r2,r0 in VLIW 1

20: mov is5’,il0 h[il0]=h[is5]=r021: mov il1,is5 h[is5]=h[il1]=r122: mov il2,is6 h[is6]=h[il2]=r223: sle is5’,is6,ctp17 h[is5]=r1,h[is6]=r2 are ready in VLIW 1

Assign cc0 to ctp17, New h[ctp17]=cc0==> Schedule sle r1,r2,cc0 in VLIW 1

23: jset ctp17’, 33 h[ctp17]=cc0 are ready in VLIW 2==> Schedule jset cc0, 33 in VLIW 2

Make two paths

(Let us assume false path has high priority)26: mov il1’,is5 h[is5]=h[il1]=r127: mov il2,is6 h[is6]=h[il2]=r228: sub is5’,is6’,is5 h[is5]=r1,h[is6]=r2 are ready in VLIW 1

is5 is last use of r1, free h[is5]=r1Assign r3 (first free reg in VLIW 1, r1is free only in false path of VLIW 2)to dest is5, New h[is5]=r3

==> Schedule sub r1,r2,r3 in VLIW 1

29: mov is5’,il1 h[il1]=h[is5]=r330: (goto 43) remove false path and save h at 43

33: mov il1’,is5 h[is5]=h[il1]=r134: mov il2,is6 h[is6]=h[il2]=r235: add is5’,is6’,is5 h[is5]=r1,h[is6]=r2 are ready in VLIW 1

is5 is last use of r1, free h[is5]=r1Assign r1 (now free in VLIW 1 also)to dest is5, New h[is5]=r1

==> Schedule add r1,r2,r1 in VLIW 1

36: mov is5’,il1 h[il1]=h[is5]=r137: mov il0’,is5 h[is5]=h[il0]=r038: mov is5’,il3 h[il3]=h[is5]=r039: mov il1’,is5 h[is5]=h[il1]=r140: mov is5’,il0 h[il0]=h[is5]=r141: mov il3,is5 h[is5]=h[il3]=r042: mov is5’,il1 h[il1]=h[is5]=r0

We should reconcile h_map(see figure 5)43: ...

Figure 4: The scheduling/allocation process for the example.

14

h[il1]=r3h[il2]=r2

h[il0]=r1h[il1]=r0h[il2]=r2

use il0(r0), il1(r3), il2(r2)

h[il0]=r0mov r0, r3mov r1, r0

use il0(r0), il1(r3), il2(r2)

h[il0]=r0h[il1]=r3h[il2]=r2

43: ...

VLIW 119: add r0, r2, r023: sle r1, r2, cc028: sub r1, r2, r335: add r1, r2, r1

VLIW 223: jset cc0, 33

TF

mov r0, r3mov r1, r0

avail[r0]=1avail[r1]=1

(c) Final Schedule(a) Conflicts at join points



TF

(b) Reconcile register allocation



TF

Conflict !!

Figure 5: The final VLIW schedule.

speed has not been fully optimized. Nonetheless, we report some preliminary experimental results

with the SPECJVM98 benchmark suites to evaluate the quality and speed of V LaTTe.

4.1 Experimental Environment

Our Java benchmarks are composed of 8 programs from the SPECJVM98 benchmarks which are

depicted in Table 1. Our test machine is a SUN Ultra5 270MHz with 256 MB of memory running

Solaris 2.6, tested in single-user mode.

Benchmark Description Bytes200 check JVM checker 27606201 compress Compress Utility 24326202 jess Expert Shell System 45392209 db Database 26425213 javac Java Compiler 92000222 mpegaudio MP3 Decompressor 38930227 mtrt Raytracer 34009228 jack Parser Generator 51380

Table 1: Java Benchmark Description and Translated Amount of Bytes.

The target VLIW machines are based on parametric resource constraints of n ALUs and n-way

branching, and we have evaluated three machines, with n = 2, 4, and 8. Each ALU can perform

either an arithmetic operation or a memory operation in one cycle, while branches are handled by

a dedicated branch unit. All these machines are assumed to have 96 integer registers (32 SPARC

+ 64 scratch for renaming), 48 floating point registers, 8 condition registers, and perfect caches.

Since our implementations runs on the SPARC machine, a set of SPARC simulation instruc-

tions (in direct binary form) is generated for each VLIW. These SPARC instructions emulate the

15

actions of each VLIW and update statistics. In fact, we use a compiled simulation method similar

to Shade [9] for simulating our VLIW machine on the SPARC.

V LaTTe’s JIT translation is closely coordinated with LaTTe JVM’s on-demand exception

handling which involves a collaboration between the translated code, the JIT compiler, and the

JVM for streamlining the translation process across exception handlers and finally blocks.

LaTTe JVM is also equipped with a lightweight monitor [10] and a fast mark-and-sweep garbage

collector.

4.2 Useful IPC Results

Figure 6 depicts the useful IPC for each benchmark for the three VLIW machines. The useful IPC

is calculated by the number of SPARC instructions in the sequential execution trace (where there

is a single SPARC instruction in a VLIW) divided by the number of executed VLIW instructions.

Figure 6: Useful IPCs.

Considering the non-numerical and object-oriented nature of many of our benchmarks, the

geometric mean of useful IPC 1.7 (2-ALU) and 2.1 (4-ALU) appear to be competitive, while that

of the 8-ALU machine is somewhat lower. This is mainly due to small tree regions, especially in

16

213 javac, 227 mtrt, and 228 jack where many OO features limits the ILP opportunity. We

expect a better IPC by employing OO optimizations which are under development.

4.3 Scheduling and Register Allocation Overhead

In order to evaluate the scheduling/register allocation overhead of V LaTTe, we measured the

running time of phase 3 and compared it with the running time of phase 3 of LaTTe which

performs fast register allocation only (the result is shown in Table 2). LaTTe’s JIT compilation

time mostly comes from register allocation overhead and is shown to be tiny compared to the whole

running time of programs (i.e., consistently taking 1-2 seconds for the SPECJVM98 which runs

40-80 seconds)2 [1]. In this way, we can evaluate the translation overhead of V LaTTe indirectly.

Benchmark Register Allocation R.A + Scheduling RatioTime (LaTTe) Time (V LaTTe) (V LaTTe/LaTTe)

200 check 0.87 3.26 3.75201 compress 0.78 2.70 3.46202 jess 1.37 4.50 3.28209 db 0.82 2.89 3.52213 javac 2.66 9.29 3.49222 mpegaudio 1.29 5.28 4.09227 mtrt 1.02 3.98 3.90228 jack 1.94 6.15 3.17

geomean 3.57

Table 2: Scheduling/Register Allocation Time (seconds).

On average, the unified scheduling/allocation time of V LaTTe is about 3.6 times longer than

the running time of LaTTe’s fast register allocator. Since V LaTTe will eventually run on a

faster VLIW machine, its translation time appears to be reasonable. The translation time can be

reduced with further tuning and with adaptive compilation which is also under development.

Since the current implementation is a research prototype, the absolute running time might

be less meaningful. We examined how the scheduling/allocation time is affected by the size of a

method by measuring the scheduling/allocation time of each method separately. Figure 7 depicts

the plot of (number of instructions of a method, scheduling/allocation time) logarithmically for all

methods in SPECJVM98 benchmarks. The figure shows that the scheduling/allocation overhead2LaTTe’s overall performance on SPECJVM98 is 2.5 times higher than SUN JDK 1.1.6 JIT compiler and similar

to JDK 1.2 beta4 (HotSpot) without exploiting adaptive compilation [1].

17

is linear to the number of instructions in the method and the slope is 1 in logarithmic scale,

indicating practically linear complexity of our algorithm.

Figure 7: Linear translation time of V LaTTe.

5 Summary

In this paper, we have described the design and implementation of V LaTTe, a Java JIT compiler

with fast and efficient scheduling and register allocation for VLIW machines. Our linear-scan

algorithm with integrated scheduling and register allocation with a local lookahead is an elaborate

engineering solution that trade-offs the quality and speed of scheduling/allocation, tuned just

enough for Java JIT compilation. This is confirmed by our preliminary experimental results on

translation overhead and the IPC, which indicates that V LaTTe is promising and would lead to a

more competitive JIT compiler if combined with other optimizations including OO optimizations

and adaptive compilation, which is left as future work.

18

References

[1] B. Yang et. al. LaTTe: A Java VM just-in-time compiler with fast and efficient register allocation.submitted for publication, 1999.

[2] S. Sinnhal and B. Nguyen. The Java Factor. Communications of the ACM, 41(6):34–47, June 1998.

[3] F. Yellin and T. Lindholm. The Java Virtual Machine Specification. Addison-Wesley, 1996.

[4] A. Tucker. Coloring a Family of Circular Arcs. SIAM Journal of Applied Mathematics, 29(3):493–502,Nov 1975.

[5] K. Ebcioglu. Some Design Ideas for a VLIW Architecture for Sequential Natured Software. InM. Cosnard et al., editor, Parallel Processing(proceedings of IFIP WG 10.3 Working Conference onParallel Processing (Pisa, Italy), pages 3–21. North Holland, April 1988.

[6] R. Morgan. Building an Optimizing Compiler. Digital Press, 1998.

[7] K. Ebcioglu and E. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. InProceedings of ISCA’97, 1997.

[8] S. Mahlke et. al. Sentinel scheduling for VLIW and superscalar processors. In Proceedings of ASPLOS-5, Oct. 1992.

[9] R. F. Cmelik and D. Keppel. Shade: A Fast Instruction Set Simulator for Execution Profiling.Technical Report TR-93-12, Sun Microsystems, July 1993.

[10] B.-S. Yang, J. Lee, J. Park, S.-M. Moon, and K. Ebcioglu. Lightweight Monitor in Java VirtualMachine. In the 3rd Workshop on Interaction between Compilers and Computer Architectures, Oct1998.

19

1

Speeding-up Multiprocessors Running DSS Workloads

through Coherence Protocols Pierfrancesco Foglia, Roberto Giorgi, Cosimo Antonio Prete

Dipartimento di Ingegneria dell’Informazione, Facolta’ di Ingegneria, Universita’ di Pisa, Via Diotisalvi, 2 – 56126 PISA (Italy)

[email protected], prete,[email protected]

Abstract In this work, we analyze how a DSS (Decision Support System) workload can be accelerated in the case of a shared-bus

shared-memory multiprocessor, by adding simple support to the classical MESI solution for the coherence protocol. The

DSS workload has been set-up utilizing the TPC-D benchmark on the PostgreSQL DBMS. Analysis has been performed via

trace driven simulation and the operating system effects are also considered in our evaluation.

We analyzed a basic four-processor and a high-end sixteen-processor machine, implementing MESI and two coherence

protocols which deal with migration of processes and data: PSCR and AMSD. Results show that, even in the four processor

case, for a DSS workload the use of a write-update protocol with a selective invalidation strategy for private data improves

performance (and scalability) with respect to a classical MESI based solution, because of the access pattern to shared data

and the lower bus utilization due to the absence of invalidation miss when we eliminate the contribution of passive sharing.

In the 16 processor case, and especially in situation when the scheduler cannot apply the affinity requirements, the gain

becomes more important: the advantage of a write-update protocol with a selective invalidation strategy for private data, in

term of execution time, could be quantified in a 20% relatively to the other evaluated protocols. This advantage is about

50% in the case of high cache-to-cache transfer latency.

Keywords: Shared-Bus Multiprocessors, DSS system, Cache Memory, Coherence Protocol, Sharing Analysis, Sharing

Overhead.

1 Introduction An ever-increasing number of multiprocessor server systems shipped today run commercial workloads [50]. These

workloads include databases applications such as online transaction processing (OLTP) and decision support system (DSS),

file servers and application servers [34]. Nevertheless, technical workloads were widely used to drive the design of current

multiprocessor systems [34], [10], [50] and different studies have shown that commercial workloads exhibit different

behavior from technical ones [20] [25].

The simpler design for a multiprocessor system is a shared-bus shared-memory architecture [51]. In shared-bus systems,

processors access the shared memory through a shared bus. The bus is the bottleneck of the system, since it can easily reach

a saturation condition, thus limiting the performance and the scalability of the machine. The classical solution to overcome

this problem is the use of per-processor cache memories [22]. Cache memories introduce the coherency problem and the

need for adopting adequate coherence protocols [37], [38]. The main coherence protocol classes are Write-Update (WU) and

2

Write-Invalidate (WI) [37]. WU protocols update the remote copies on each write involving a shared copy. Whereas, WI

protocols invalidate remote copies in order to avoid updating them. Coherence protocols generate several bus transactions,

thus accounting for a non-negligible overhead in the system (coherence overhead). Coherence overhead may have a negative

effect on the performance and, together with the accesses pattern to application data, determines the best protocol choice for

a given workload [55], [56], [19]. Different optimizations to minimize coherence overhead have been proposed [12], [37],

[42], [18], also acting at compile time and architectural level (as the adoption of adequate coherence protocols [39], [18]).

When the performance achieved by shared-bus shared-memory multiprocessor is not sufficient, and this is typical for

DBMS applications, an SMP (Symmetrical Multi-Processing, which includes shared-bus shared-memory architectures) or a

NUMA (Non Uniform Memory Access) approach can be utilized [51], [22]. In the 1st case, a crossbar switch interconnects

the processing elements, in the 2nd an Interconnection Network. Such solutions increase the communication bandwidth

among elements, thus allowing more CPU to be added to the system, but at the cost of more expensive and complex

communication network. In both the design, the basic building bock (node) may be a single processor system or, better, a

shared-bus shared-memory multiprocessor (example are the HP V-Class and the SGI Origin families of multiprocessors

[52]). In this way, by adding high performance nodes, we can achieve the desired level of performance with only a little

number of elements, simplifying the crossbar or IC network design, and lowering the price of the whole system.

Unfortunately, due to limited bus bandwidth, only a small number of CPU (max 4 for the Pentium family of CPU [32]) may

be included in the single node.

The aim of this paper is to analyze the scalability of shared-bus shared-memory multiprocessors running commercial

workload and DSS applications in particular, and to investigate solutions, as concern the memory subsystem, which can

increase the processing power of such architectures. In this way, we can meet the performance requirement of commercial

applications with a single shared-bus shared-memory machine, or we can adopt more performing nodes in an SMP or

NUMA design, allowing simpler and cheaper design of switch or IC networks.

Decision Support Systems are based on DBMS; they are utilized to extract management information and analyze huge

historical data: a typical DSS activity is performed via a query to look for ways to increase revenues, such as an SQL query

to quantify the amount of revenue increase that would have resulted from eliminating a company discount in a given

percentage in a given year [53]. DSS queries are long running, typically scanning large amounts of data [50]. In particular,

such queries are, in most part, read-only queries. Consequently, as already observed in [45], [52], coherence activity and

transactions involve essentially DBMS metadata (i.e. data structures that are needed by the DBMS to work rather than to

store the data) and particularly data structure needed to implement software locks [45], [52]. The type of access pattern to

those data is one-producer/many-consumers. Such pattern may advantage solution based on the Write Update family of

coherence protocols, especially if there is a high number of processes (the ‘consumers’) that concurrently read shared data

[19]. These considerations can suggest the adoption of other coherence protocols, with respect to the usually utilized MESI

WI protocol [36].

3

In our evaluation, the DSS workload of the system is generated by running all the TPC-D queries [44] (through the

PostgreSQL [49] DBMS) and several Unix utilities, which both access the file system and interface the DBMS with system

services running concurrently. Our methodology relies on trace-driven simulation, by means of the “Trace Factory”

environment [17], and on specific tools for the analysis of coherence overhead [14].

The performance of the memory subsystem – and therefore of the whole system – depends on cache parameters,

coherence management techniques, and it is influenced by operating system activities like process migration, cache affinity

scheduling, kernel/user code interference and virtual memory mapping. The importance of considering operating system

activity has been highlighted in previous works [6], [5], [41]. Process migration, needed to achieve load balancing in such

systems, increases the number of cold/conflict misses and also generates useless coherence overhead, known as passive

sharing overhead [18]. As passive sharing overhead dramatically decreases the performance of pure WU protocols [18],

[47], we considered in our analysis an hybrid WU protocol, PSCR [18], which adopts a selective invalidation strategy for

private data and an hybrid WI protocol, AMSD [8], [33], which deals with migration of data and, then, of processes.

Our results show that, in the case of a 4-processor configuration, performance of DSS workload is moderately influenced

by cache parameters and the influence of coherence protocol is minimal. In 16 processors configurations, the performance

differences due to the adoption of different architectural solutions are significant. In high-end architectures, MESI is not the

best choice and workload changes can nullify the action of affinity-based scheduling algorithms. Architectures based on a

write-update protocol with a selective invalidation strategy for private data outperform the ones based on MESI of about

20%, and adapt better to workload. We will also show solutions that are less sensitive to cache-to-cache transfer latency.

The rest of the paper is organized as follows. In Section 2, we discuss architectural parameters and issues related to the

coherence overhead. In Section 3, we report the results of studies related to the analysis of workloads similar to ours,

differentiating our contribution. In Section 4, we present our experimental setup and methodology. In Section 5, we discuss

the results of our experiments. In Section 6, we draw the conclusions.

2 Issues Affecting Coherence Overhead Cache coherency maintaining involves a number of bus operations. Some of them are overhead that adds up to the basic

bus traffic (the traffic necessary to access main memory). Three different sources of sharing may be observed: i) true sharing

[40], [42], which occurs when the same cached data item is referenced by different processes concurrently running on

different processors; ii) false sharing [40], [42], which occurs when several processors reference a different data item

belonging to the same memory block separately; iii) passive [30] or process-migration [1] sharing, which occurs when a

memory block, though belonging to a private area of a process, is replicated in more than one cache, as a consequence of the

migration of the owner process.

The main coherence protocol classes are Write-Update (WU) and Write-Invalidate (WI) [37]. WU protocols update the

remote copies on each write involving a shared copy. Whereas, WI protocols invalidate remote copies in order to avoid

updating them. An optimal selection for the coherence protocol can be made by considering the traffic induced by the two

4

approaches in case of different sharing and its effects on performances [55]. The coherence overhead induced by a WU

protocol is due to all the operations needed to update the remote copies. Whereas, a WI protocol invalidates remote copies

and processors generate a miss on the access to the invalidated copy (invalidation miss). The access patterns to shared data

determine the coherence overhead. In particular, fine-grain sharing denotes high contention for shared-data; sequential

sharing is characterized by long sequences of writes to the same memory item performed by the same processor. A WI

protocol is adequate in the case of sequential sharing, whilst, in general, a WU protocol performs better than WI for program

characterized by fine-grain sharing [56], [19], [16] (in alternative, a metric based on the write-run and external rereads model

can be used to estimate the best protocol choice [2]). In our evaluation, we considered MESI protocol as baseline, because it

is used in most of the high performance processors today (like AMD K5 and K6, PowerPC series, SUN UltraSparc II, SGI

R10000, Intel Pentium, Pentium Pro series (Pro, II, III), Pentium 4, and IA-64/Itanium). Then, based on the previous

considerations on the access pattern of DSS workloads, we included a selective invalidation protocol based on a WU policy

for manage shared data, which limits most effects of process migration (PSCR [18]) and a WI protocol specifically designed

to treat the data migration (AMSD [8], [33]).

Our implementation of MESI uses the classical MESI protocol states [36] and the following bus transactions: read-block

(to fetch a block), read-and-invalidate-block (to fetch a block and invalidate any copies in other caches), invalidate (to

invalidate any copy in other caches), and update-block (to write back dirty copies, when they need to be replaced). This is

mostly similar to the implementation of MESI in Pentium Pro (Pentium II) processor family [32]. The invalidation

transaction used to obtain coherency has, as a drawback, the need to reload a certain copy, if a remote processor uses again

that copy, thus generating a miss (Invalidation Miss). Therefore, MESI coherence overhead (that is the transactions needed

to enforce coherence) is due both to Invalidate Transactions and Invalidation Misses.

PSCR (Passive Shared Copy Removal) adopts a selective invalidation scheme for private data, and uses a Write-Update

scheme for shared data, although it is not a purely Write-Update protocol. A cached copy belonging to a process private area

is invalidated locally as soon as another processor fetches the same block [18]. This technique reduces coherence overhead

(especially passive sharing overhead), otherwise dominant in a pure WU protocol [30]. Invalidate transactions are eliminated

and coherence overhead is due to Write Transactions and Invalidation Misses caused by local invalidation (Passive Sharing

Misses).

AMSD [8], [33] is designed for Migratory Sharing, which happens when the control over shared data migrates from one

process to another running on a different processor. The protocol identifies migratory-shared data dynamically to reduce the

cost of moving them. The implementation relies on an extension of a common MESI protocol. Though designed for

migratory sharing, AMSD may have some beneficial effects also on passive sharing. AMSD coherence overhead is due to

Invalidate Transactions and Invalidation Misses.

The process scheduling strategy, cache parameters, and the bus features also influence the coherence overhead and,

therefore, the multiprocessor performance. The process scheduling guaranties the load balancing among the processors by

scheduling a ready process on the first available processor. Although cache affinity is used, a process may migrate on a

5

different processor rather than on the last used processor, producing at least two effects: i) some misses when it restarts on a

new processor (context-switch misses); ii) useless coherence transactions due to passive sharing. These situations may

become frequent due to dynamicity of a workload driven by the user requests, like DSS ones. Process migration influences

also access patterns to data and, therefore, the coherency overhead.

All the factors described above have important consequences on the global performance that we shall evaluate in the

Section 5.

3 Related Work on DSS systems In this Section, we consider a summary of main results of evaluations for DSS and similar workloads on several

multiprocessor architectures and operating systems. The research of a realistic evaluation framework for shared memory

multiprocessor evaluations [34], [50] motivated many studies that consider benchmarks like TPC-series (including DSS,

OLTP, WEB-server benchmarks) representative of commercial workloads [45], [3], [4], [31], [28], [27], [25].

Trancoso et al. study the memory access patterns of a TPC-D-based DSS workload [45]. The DBMS is Postgres95

running on a simulated four-processor CC-NUMA multiprocessor. For cache block sizes ranging from 4 to 128 bytes and

cache capacities from 128K-bytes to 8M-bytes, the main results are that both large cache blocks and data prefetching help,

due to spatial locality of index and sequential queries. Coherence misses can be more than 60% of total misses in queries that

uses index scan algorithms for select operations.

Barroso et al. evaluate an Alpha 21164-based SMP memory system through hardware counter measurements and SimOS

simulations [3]. Their system was running Digital-UNIX and Oracle-7 DMBS. The authors consider a TPC-B database

(OLTP benchmark), TPC-D (DSS queries), and the Altavista search engine. They found that memory is accounting for 75%

of stall time. In the case of TPC-D and Altavista workloads, the size and latency of the on-chip caches influences mostly the

performance. The number of processes per processor, which is usually kept high in order to hide I/O latencies, also

significantly influences cache behavior. When using 2-, 4-, 6-, and 8- processor configurations, they found that coherency

miss stalls increase linearly with the number of processors. Beyond an 8M-bytes outer level cache, they observed that true

sharing misses limit performance.

Cao et el. examine a TPC-D workload executing on a Pentium-Pro 4-processor system, with Windows NT and MS SQL

Server [4]. Their goal is to characterize this DSS system on a real machine, in particular, regarding processor parameters, bus

utilization, and sharing. Their methodology is based on hardware counters. They found that kernel time is negligible (less

than 6%). Major sources of processor stalls are instruction fetch and data miss in outer level caches. They found lower miss

rates for data caches in comparison with other studies on TPC-C [3], [25]. This is due to the smaller working set of TPC-D

compared with TPC-C.

Ranganathan et al. consider both an OLTP workload (modeled after TPC-B [43]) and a DSS workload (query 6 of TPC-

D [44]) [31]. Their study is based on trace-driven simulation, where traces are collected on a 4-processor AlphaServer4100

running Digital Unix and Oracle 7 DBMS. The simulated system is a CC-NUMA shared-memory multiprocessor with

6

advanced ILP support. Results, on a 4-processor system and an ILP configuration with 4-way issue, 64-entry instruction

window, 4 outstanding misses, provide already significant benefits for OLTP and DSS workload. Such configurations are

even less aggressive than ILP commercial processor like Alpha 21264, HP-PA 8000, MIPS R10000 [48]. The latter

processor, used in our evaluation, makes us reasonably safe that this processor architecture is sound for investigation in the

memory subsystem.

Another performance analysis of OLTP (TPC-B) and DSS (query 6 of TPC-D) workloads via simulation is presented in

[28]. The simulated system is a commercial CC-NUMA multiprocessor, constituted by four-processor SMP nodes connected

using a Scalable Coherent Interface based coherent interconnects. Coherence protocol is directory-based. Each node is based

on a Pentium Pro processor with a 512K-byte, 4-way set associative cache. Results show that TPC-D miss rate is much

lower and performance is less sensitive to L2 miss latency than in the OLTP (TPC-B) experiments. They analyze also

scalability exhibited by such workloads. The speed-up of the DSS workload is near to the theoretical ones. Results show that

scalability strongly depends on miss rate.

Lo et al. [27] analyze the performance of database workloads running on simultaneous multithreading processors [13] -

an architectural technique to hide memory and functional units latencies. The study is based on trace-driven of a four

processor AlphaServer4100 and Oracle 7 DBMS as in [3]. They consider both an OLTP workload (modeled after the TPC-B

[43] benchmark) and a DSS workload (query 6 of the TPC-D [44] benchmark). Results show that while DBMS workloads

have larger memory footprints, there is a substantial data reuse in a small working set.

Summarizing, these studies considered DSS workloads but they were mostly limited to four-processor systems, did not

consider the effects of process migration, and did not correlate the amount of sharing to the performance of the system. As

we have more processors, it becomes crucial to characterize further the memory subsystem. In our work, we investigated

both 4- and 16- processor configurations, finding that larger caches may have several drawbacks due to coherence overhead.

This is mostly related to the use of shared structures like indices and locks in TPC-D workload. In Section 5, we will classify

the sources of this overhead and propose solutions to overcome limitations to the performance related to process migration.

4 Methodology and Workload The methodology that we used is based on trace-driven simulation [35], [29], [46] and on the simulation of the three

kernel activities that most affect performance: system calls, process scheduling, and virtual-to-physical address translation.

The approach used is to produce a source trace - a sequence of memory references, system-call positions (and

synchronization events if needed) - by means of a tracing tool. Trace Factory then models the execution of complex

multiprogrammed workloads by combining multiple source traces and simulating system calls (which could also involve I/O

activity), process scheduling and virtual-to-physical address translation. Finally, Trace Factory produces the references

(target trace) furnished as input to a memory-hierarchy simulator [29]. Trace Factory generates references according to an

on-demand policy: it produces a new reference when simulator requests one, so that the timing behavior imposed by the

memory subsystem conditions the reference production [17]. Process management is modeled by simulating a scheduler that

7

dynamically assigns a ready process. Virtual-to-physical address translation is modeled by mapping sequential virtual pages

into non-sequential physical pages. A careful evaluation of this methodology has been carried out in [17].

Table 1. Statistics of source traces for some Unix commands and daemons

and for multiprocess source traces (PostgreSQL, TPC-D queries) both in case of 64-byte block size and 10,000,000 references per process.

DATA (%) SHARED DATA (%) APPLICATION NO. OF

PROCESSES DISTINCTBLOCKS

CODE (%) READ WRITE

SHAREDBLOCKS ACCESS WRITE

awk (beg) 1 4963 76.76 14.76 8.48 n/a n/a n/a awk (mid) 1 3832 76.59 14.48 8.93 n/a n/a n/a cp 1 2615 77.53 13.87 8.60 n/a n/a n/a rm 1 1314 86.39 11.51 2.10 n/a n/a n/a ls –aR 1 2911 80.62 13.84 5.54 n/a n/a n/a telnetd 1 463 82.75 12.96 4.29 n/a n/a n/a crond 1 2464 75.86 16.35 7.79 n/a n/a n/a syslogd 1 2848 80.41 14.96 4.63 n/a n/a n/a TPC-D18 18 139324 73.05 16.89 10.06 7224 1.58 0.43 TPC-D12 12 93657 73.04 16.91 10.05 5662 1.55 0.41

Table 2. Statistics of multiprogrammed target traces (DSS26 , 260,000,000 references; DSS18 , 180,000,000 references) in case of 64-byte block size.

DATA (%) SHARED DATA (%) WORKLOAD NO. OF

PROCESSES DISTINCTBLOCKS

CODE (%)

READ WRITE

SHAREDBLOCKS

ACCESS WRITE DSS26 26 179862 74.59 16.26 9.15 7806 1.76 0.53 DSS18 18 124268 74.00 16.11 8.89 6242 1.63 0.48

The workload considered in our evaluation includes DB activity reproduced by means of an SQL server, namely

PostgreSQL [49], which handles the TPC-D [44] queries. We also included Unix utilities that access the file system,

interface the various programs running on the system, and reproduce the activity of typical Unix daemons.

PostgreSQL is a public domain DBMS, which relies on server-client paradigm. It consists of a front-end process that

accepts SQL queries, and a back-end that forks processes, which manage the queries. TPC-D is a benchmark for DSS

developed by the Transaction Processing Performance Council [44]. It simulates an application for a wholesale supplier that

manages, sells, and distributes a product worldwide. Following TPC-D specifications, we populated the database via the

dbgen program, with a scale factor of 0.1. The data are organized in several tables and accessed by 17 read-only queries and

2 update queries.

In a typical situation, application and management processes can require the support of different system commands and

ordinary applications. To this end, Unix utilities (ls, awk, cp, and rm) and daemons (telnetd, syslogd,

crond) have been added to the workload. These utilities are important because they model the ‘glue’ activity of the system

software. These utilities: i) do not have shared data and thus they increase the effects of process migration, as discussed in

detail in the Section 5; ii) they may interfere with shared data and code cache-footprint of other applications. To take into

account that requests may be using the same program at different times, we traced some commands in shifted execution

sections: initial (beg) and middle (mid).

In our experiments, we generated two distinct workloads. The first workload (DSS26 in Table 2) includes the TPC-D18

source trace (Table 1). TPC-D18 is a multiprocess source trace taking into account the activity of 18 processes. One process is

8

generated by DBMS back-end execution, whilst the other processes are generated by the concurrent execution of the 17

read-only TPC-D queries (TPC-D18 in Table 2). Since we wanted to explore critical situations for the affinity scheduling –

when the number of process changes, - we also generated a second workload (DSS18 in Table 2) that includes a subset of

TPC-D queries (TPC-D12). Table 1 contains some statistics of the uniprocess and multiprocess source traces used to generate

the DB workloads (target traces, Table 2). Both source traces related to DBMS activity (TPC-D18, TPC-D12) and the

resulting workloads (DSS26 e DSS18) present similar characteristics in terms of read, write and shared accesses.

5 Results In this Section, we wish to show the memory subsystem performance, for our DSS workload, with a detailed

characterization of coherence overhead and process migration problems. To this end, we included results for several values

of the most influencing cache-architecture parameters. Finally, we considered critical situations for the affinity scheduling

and we analyzed the sensitivity to cache-to-cache latency. Our results show that solutions that allow us to achieve a higher

scalability are possible for this kind of machine – compared to standard solutions, - and consequently a greater performance

at a reasonable cost.

5.1 Design Space of our System The simulated system consists of N processors, which are interconnected to a 128-bit shared bus for accessing shared

memory. The following coherence schemes have been considered: AMSD, MESI, and PSCR (more details are in Section 2).

We considered two main configurations: a basic machine with 4 processors and a high-performance one with 16 processors.

The scheduling policy is based on cache-affinity; scheduler time-slice is 200,000 references. Cache size has been varied

between 512K bytes and 2M bytes, while for block size we used 64 bytes and 128 bytes. The simulated processors are

MIPS-R10000-like; paging relays on 4-KByte-page size; the bus is pipelined, supports transaction splitting, and processor-

consistency memory model [15]; up to 8 outstanding misses are allowed [26]. The base case study timings and parameter

values for the simulator are summarized in Table 3.

Table 3. Input parameters for the multiprocessor simulator (timings are in clock cycles)

Class Parameter Timings CPU Read cycle 2

Write cycle 2 Cache Cache size (Bytes) 512K, 1M, 2M

Block size (Bytes) 64, 128 Associativity (Number of Ways) 1, 2, 4

Bus Write transaction (PSCR) 5 Write for invalidate transaction (AMSD, MESI) 5 Invalidate transaction (AMSD) 5 Memory-to-cache read-block transaction 72 (block size 64 bytes), 80 (block size 128 bytes) Cache-to-cache read-block transaction 16 (block size 64 bytes), 24 (block size 128 bytes) Update-block transaction 10 (block size 64 bytes), 18 (block size 128 bytes)

5.2 Performance Metrics To investigate the effects on the performance due to memory subsystem operations, we analyzed the causes influencing

memory latency and the total execution time. The memory latency, depends on the time necessary to perform a bus operation

9

(Table 3) and on the waiting time to access bus (bus latency). Bus latency depends on the amount and kind of bus traffic.

Bus traffic is constituted by read-block transactions (issued for each miss), update transactions and coherence transactions

(write transactions or invalidate transactions, depending on the coherence protocol). Consequently, miss and coherence

transactions affect performance because they affect the processor waiting-time directly (in the case of read misses) and

contribute to bus latency. Therefore, to investigate the sources of performance bottlenecks, we reported a breakdown of

misses - which includes invalidation misses and classical misses (sum of cold, capacity and conflict misses) - and the

“number of coherence transaction per 100 memory references” - which includes either write-transactions or invalidate

transactions depending on the coherence protocol. The rest of traffic is due to update transactions. Update transactions are a

negligible part of bus-traffic (lower than 7% of read-block transactions or about 1% of bus occupancy) and thus they do not

influence greatly our analysis.

Normalized Execution Time

0,90,910,920,930,940,950,960,970,980,99

1

512K

/1

512K

/2

512K

/4

1M/1

1M/2

1M/4

2M/1

2M/2

2M/4

Cache Size / Number of Ways

MESI AMSD PSCR

Figure 1. Normalized Execution Time versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4) and coherence protocol (AMSD, MESI, PSCR). Data assume 4 processors, 64-byte block size and a cache affinity scheduler. Execution Times are normalized with respect to the MESI – 512K, direct access architecture. PSCR presents the lowest Execution

Time, whilst MESI the highest.

In our analysis, we differentiated between Cold Misses and Capacity+Conflict Misses and Invalidation Misses [22].

Cold Misses include first access misses by a given process when it is scheduled either for the first time or it is rescheduled

on another processor (the latter are also known as context-switch misses). Capacity+Conflict Misses include misses caused

by memory references of a process competing for the same block (intrinsic interference misses in [1]), and misses caused by

references of sequential processes, executing on the same processor and competing for the same cache block (extrinsic

interference misses in [1]). Invalidation Miss is due to accesses to data that are going to be reused on the same processor, but

that have been invalidated in order to maintain coherence. Invalidation Misses are further classified, along with the

coherence transaction type, by means of an extension of an existing classification algorithm [23]. Our algorithm extends this

classification to the case of passive sharing, finite size caches, and process migration [14]. In particular, Invalidation Misses

are differentiated as true sharing misses, false sharing misses, passive sharing misses. True sharing misses and false sharing

misses are classified according to an already know methodology [42], [12], [11]. Passive sharing misses are invalidation

misses generated by the useless coherence maintaining of private data [18]. Clearly, these private data could appear as shared

10

to the coherence protocol, because of the process migration. Similarly, the coherence transactions are classified as true, false,

and passive sharing transactions either in the case they are invalidation or write transactions [14].

Miss Rate (%)

0

0,1

0,2

0,3

0,4

0,551

2K/1

/MES

I

512K

/1/A

MSD

512K

/1/P

SCR

512K

/2/M

ESI

512K

/2/A

MSD

512K

/2/P

SCR

512K

/4/M

ESI

512K

/4/A

MSD

512K

/4/P

SCR

1M/1

/MES

I

1M/1

/AM

SD

1M/1

/PSC

R

1M/2

/MES

I

1M/2

/AM

SD

1M/2

/PSC

R

1M/4

/MES

I

1M/4

/AM

SD

1M/4

/PSC

R

2M/1

/MES

I

2M/1

/AM

SD

2M/1

/PSC

R

2M/2

/MES

I

2M/2

/AM

SD

2M/2

/PSC

R

2M/4

/MES

I

2M/4

/AM

SD

2M/4

/PSC

R

Cache Size / Number of ways / Coherence Protocol

Capacity + Conflicy Miss Invalidation MissCold Miss

Figure 2. Breakdown of miss rate versus cache size (512 K, 1M, 2M bytes), number of ways (1, 2, 4) and coherence protocol

(AMSD, MESI, PSCR). Data assumes 4 processors, affinity scheduling and 64-byte block.

Invalidation Miss Rate (%)

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0,09

0,1

512K

/1/M

ESI

512K

/1/A

MSD

512K

/1/P

SCR

512K

/2/M

ESI

512K

/2/A

MSD

512K

/2/P

SCR

512K

/4/M

ESI

512K

/4/A

MSD

512K

/4/P

SCR

1M/1

/MES

I

1M/1

/AM

SD

1M/1

/PSC

R

1M/2

/MES

I

1M/2

/AM

SD

1M/2

/PSC

R

1M/4

/MES

I

1M/4

/AM

SD

1M/4

/PSC

R

2M/1

/MES

I

2M/1

/AM

SD

2M/1

/PSC

R

2M/2

/MES

I

2M/2

/AM

SD

2M/2

/PSC

R

2M/4

/MES

I

2M/4

/AM

SD

2M/4

/PSC

R

Cache Size / Number of Ways / Coherence Protocol

False Sharing Miss (Kernel)True Sharing Miss (Kernel)False Sharing Miss (User)True Sharing Miss (User)Passive Sharing Miss

Figure 3. Breakdown of miss rate versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4) and coherence protocol

(AMSD, MESI, PSCR). Data assume 4 processors, affinity scheduling and 64-byte block.

11

5.3 Analysis of the Reference System We started our analysis from a 4-processor machine similar to cases used in literature [45], [3], [4], [31], running the

DSS26 workload (Table 2).

In detail, the reference 4-processor machine has a 128-bit bus and 64-byte block size. We varied cache size (from 512K

to 2M byte) and cache associativity (1, 2, 4).

Execution Time (Figure 1) is affected by the cache architecture. It decreases with larger cache sizes and/or more

associativity. Anyway, this variation is limited to an 8% between the less (512K byte, one way) and most performing (2M

byte, four way) configuration. In this case, the role of the coherence protocol is less important, with a difference among the

various protocols, for a given setup, which is less than 1%.

We analyzed the reason for this performance improvement (Figure 2) by decomposing the miss rate in term of traditional

(cold, conflict and capacity) and invalidation misses. In Figure 3, we show the contribution of each kind of sharing to the

Invalidation Miss Rate and, in Figure 4, to the Coherence-Transaction “Rate” (i.e. the number of coherence transactions per

100 memory references). We also differentiated between kernel and user overhead.

As expected, cache size mainly influences capacity misses. Invalidation misses increase slightly when increasing cache

size or associativity (Figure 2). For cache sizes larger than 1M bytes, cold misses are the dominating part and invalidation

misses weigh more and more (at least 30% of total misses). This indicates that, for our DSS workload, caches larger than 2M

bytes already capture the main working set. In fact, for cache sizes larger than 2M bytes, Cold Misses remain constant,

invalidation misses increase and only Capacity+Conflict misses may decrease, but their contribution to miss rate is already

minimal. The solution based on PSCR presents the lowest miss rate, and negligible invalidation misses.

Number of Coherence Transactions per 100 Memory References

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

512K

/1/M

ESI

512K

/1/A

MSD

512K

/1/P

SCR

512K

/2/M

ESI

512K

/2/A

MSD

512K

/2/P

SCR

512K

/4/M

ESI

512K

/4/A

MSD

512K

/4/P

SCR

1M/1

/MES

I

1M/1

/AM

SD

1M/1

/PSC

R

1M/2

/MES

I

1M/2

/AM

SD

1M/2

/PSC

R

1M/4

/MES

I

1M/4

/AM

SD

1M/4

/PSC

R

2M/1

/MES

I

2M/1

/AM

SD

2M/1

/PSC

R

2M/2

/MES

I

2M/2

/AM

SD

2M/2

/PSC

R

2M/4

/MES

I

2M/4

/AM

SD

2M/4

/PSC

R

Cache Size / Number of Ways / Coherence Protocol

False Sharing (Kernel)True Sharing (Kernel)False Sharing (User)True Sharing (User)Passive Sharing

Figure 4. Number of coherence transactions versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4), and coherence protocol (AMSD, MESI, PSCR). Coherence transactions are invalidate transactions in MESI, and AMSD, write transactions in

PSCR. Data assume 4 processors, 64-byte block size, and an affinity scheduler.

12

Our analysis of coherence overhead (Figures 3 and 4) confirms that the major sources of overhead are invalidation

misses for WI protocols (MESI and AMSD) and write-transactions for PSCR. Passive sharing overhead, either as

invalidation misses, or coherence transactions, is minimal: this means that affinity scheduling performs well. In the user part,

true sharing is dominant. In the kernel part, false sharing is the major portion of coherence overhead.

The cost of misses is dominating the performance and indeed we show in Figure 1 that PSCR is able to achieve the best

performance compared with the other protocols. The reason is the following: what PSCR looses in terms of extra coherence

traffic, is then gained as saved misses. Indeed, misses are more costly in terms of bus timings and read-block transactions

may produce a higher waiting time for the processor. Also, AMSD performs better than MESI. due to the reduction of

Invalidation Miss Rate overhead (Figure 2).

Our conclusions, for the 4-processor configuration, agree with previous studies as for the analysis of miss rate and the

effects of coherence maintaining [3], [4], [45].

Normalized Execution Time

0,650,7

0,750,8

0,850,9

0,951

512K

/1

512K

/2

512K

/4

1M/1

1M/2

1M/4

2M/1

2M/2

2M/4

Cache Size / Number of Ways

MESI AMSD PSCR

Figure 5. Normalized Execution Time versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4), and coherence

protocol (AMSD, MESI, PSCR). Data assume 16 processors, 64-byte block size, and an affinity scheduler. Execution Times are normalized with respect to the Execution Time of the MESI, 512K, direct access configuration.

Miss Rate (%)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MESI/4 MESI/16 AMSD/4 AMSD/16 PSCR/4 PSCR/16Coherence Protocol / Number of CPU

Capacity + Conflict MissInvalidation MissCold Miss

Figure 6. Breakdown of miss rate versus coherence protocol (AMSD, MESI, PSCR) and number of processors (4,16). Data assume an affinity scheduler, 64-byte block, 1M-cache size two-way set associative. The higher number of processor causes more coherence misses (false plus true sharing) and more capacity and conflict misses. It is interesting to observe that while

the aggregate cache size increases (from 4M to 16M bytes), Capacity and Conflict misses also increase. This situation is different from the uniprocessor case, where Capacity+Conflict misses always decrease when cache size increases. This effect

is due to process migration.

13

5.4 Analysis of the High-End System In the previous baseline case analysis, we have seen that the four processor machine is efficient: architecture variations

do not produce further gains (e.g. the performance differences among protocol are small). This kind of machine may not

satisfy the performance needs of DSS workloads, and more performing systems are demanded. Given current processor-

memory speeds, we considered a ‘high-end’ 16-processor configuration. A detailed analysis of the limitations of this system

shows that this architecture makes sense, i.e. a high-end SMP system results efficient, if we solve the problems produced by

the process migration and adapt the coherence protocol to the access pattern of the workload. This architecture has not been

analyzed in literature, and still represents a relatively economic solution to enhance the performance.

In v a lid a t io n M is s R a te (% )

0

0 .0 5

0 .1

0 .1 5

0 .2

0 .2 5

M E S I/4 M E S I/1 6 A M S D /4 A M S D /1 6 P S C R /4 P S C R /1 6C o h e re n c e P r o to c o l / N u m b e r o f C P U

F a ls e S h a r in g M is s (K e rn e l)T ru e S h a r in g M is s (K e rn e l)F a ls e S h a r in g M is s (U s e r )T ru e S h a r in g M is s (U s e r )P a s s iv e S h a r in g M is s

Figure 7. Breakdown of Invalidation Miss Rate versus coherence protocol (AMSD, MESI, PSCR) and number of processors (4,16). Data assume, an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Having more processors

causes more coherence misses (false and true sharing). Passive sharing misses are slightly higher in PSCR compared to the other two protocols. This is a consequence of the selective invalidation mechanism of PSCR. In fact, as soon as a private

block is fetched on another processor, PSCR invalidates all the remote copies of that block. In the other two protocols, the invalidation is performed on a write operation on shared data, thus less frequently than in PSCR.

N u m b e r o f C o h e r e n c e T r a n s a c t io n s p e r 1 0 0 M e m o r y R e fe re n c e s

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

M E S I/4 M E S I /1 6 A M S D /4 A M S D /1 6 P S C R /4 P S C R /1 6

C o h e r e n c e P ro to c o l / N u m b e r o f C P U

F a ls e S h a r in g T ra n s a c t io n s (K e rn e l)T ru e S h a r in g T ra n s a c t io n s (K e rn e l)F a ls e S h a r in g T ra n s a c t io n s (U s e r )T ru e S h a r in g T ra n s a c t io n s (U s e r )P a s s iv e S h a r in g T ra n s a c t io n s

Figure 8. Number of coherence transactions versus coherence protocol (AMSD, MESI, PSCR) and number of processors (4, 16). Data assume, an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. There is an increment in the sharing overhead in all of its components. This increment is more evident in the WI class protocols, also because there is

more passive sharing overhead.

14

In the following, we will compare our results directly with the four-processor case and show the sensitivity to classical

cache parameters (5.4.1). We will consider separately the case of different block size (5.4.2), the case of a different workload

pressure in terms of number of processes (5.4.3), and the case of a different cache-to-cache latency (5.4.4).

5.4.1 Comparison with 4-processor case and sensitivity to cache size and associativity In Figure 5, we show the Execution Time for several configurations. Protocols designed to reduce the effects of process

migration achieve better performance. In particular, the performance gain of PSCR over MESI is at least 13% in all

configurations. The influence of cache parameters is stronger, with a 30% difference between the most and less performing

configuration.

We clarify the relationship between execution Time and process migration, by showing the Miss Rate (Figure 6) and the

Number of Coherence Transactions (Figure 8). In detail, we show the breakdown of the most varying components of Miss

Rate (Invalidation Miss, Figure 7) and breakdown of coherence transactions (Figure 8). In the following, for the sake of

clearness, we assume a 1-Mbyte-cache size, a 64-byte block size, and 2-ways.

The Execution Time is mainly determined by the cost of read-block transactions and the cost of coherence-maintaining

transactions (see also Section 5.2). Let us take MESI as baseline. AMSD has a lower Execution Time because of its lower

Miss Rate (Figure 6, and in particular Invalidation Miss Rate, Figure 7). The small variation in the Number of Coherence

Transactions (Figure 8) does not weigh much on the performance. PSCR gains strongly from the reduction of Miss Rate,

while it is not penalized too much by the high Number of Coherence Transactions (Figure 6 and 8). In detail:

•

Classical misses (Cold and Conflict+Capacity) do not vary with the coherence protocol, although they increase while

switching from 4 to 16 processors because of the higher migration of the processes: for the Cold Miss portion, because a

process may be scheduled on a higher number of processors, and for the Conflict+Capacity, because a process has less

opportunities of reusing its working set and it destroys the cache footprint generated by other processes.

• The much lower Miss Rate in PSCR is due to its low Invalidation Miss Rate (Figure 6). In particular PSCR does not

have invalidation misses due to true and false sharing, neither in the user nor in the kernel mode (Figure 7). On the contrary,

in the other two protocols the latter factors weigh very much. This effect is more evident in the 16-processor case as

compared to the 4-processor case because of the higher probability of sharing data, produced by the increased number of

processors.

•

The DSS workload, in the PostgreSQL implementation, exhibits an access pattern to shared data which speed-up the WU

strategy for maintain coherence among shared-data with respect to the WI strategy. We can explain such pattern with the

following considerations: the TPC-D DSS queries are read-only queries and consequently, as already observed in [45], [52]

coherence misses and transactions involve PostgreSQL metadata (i.e. data structures that are needed by the DBMS to work

rather than to store the data) and particularly data structure needed to implement software locks [45], [52]. The type of

access pattern to those data is one-producer/many-consumers. Such pattern advantage WU protocols, especially if there is a

high number of processes (the ‘consumers’) that concurrently read shared data. When the number of processors switches

from 4 to 16, the number of consumer processes increases, due to the higher number of processes concurrently in execution

15

and, consequently, true sharing Invalidation Misses and Transactions (in the user part) increase especially for WI protocols

(Figure 7 and 8), thus speeding-up the performance of the WU solution (PSCR). However, we cannot utilize a pure WU

protocol, as passive sharing dramatically decreases performances [47].

N o r m a liz e d E x e c u t io n T im e

0 ,6 5

0 ,7

0 ,7 5

0 ,8

0 ,8 5

0 ,9

0 ,9 5

1

M E S I /6 4 M E S I /1 2 8 A M S D /6 4 A M S D /1 2 8 P S C R /6 4 P S C R /1 2 8

C o h e r e n c e P r o t o c o l / B lo c k S iz e Figure 9. Normalized Execution Time versus block size size (64, 128 bytes) and coherence protocol (AMSD, MESI, PSCR). Data assume, an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Execution Times are normalized

with respect to the execution time of the MESI, 64 bytes block configuration.

Miss Rate (%)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MESI/64 MESI/128 AMSD/64 AMSD/128 PSCR/64 PSCR/128

Coherence Protocol / Block Size


Figure 10. Breakdown of miss rate versus coherence protocol (AMSD, MESI, PSCR) and block size (64 byte, 128 byte). Data

assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. There is a decrease of Cold, Capacity+Conflict miss components, and a little decrease of invalidation miss component.

5.4.2 Analyzing the effects of a larger block size in the ‘High-End’ System We started our analysis from the 64-bytes block size for references comparison with other studies. When switching from

64 to 128 bytes, PSCR has further advantages in respect of the other two considered protocols (Figure 9). This is due to the

following two reasons. First, we observe a reduction of Capacity+Conflict miss component (Figure 10), a small reduction of

coherence traffic (Figure 12), and Invalidation Miss Rate (Figure 11). Secondly, in the case of 64-byte block, the system is in

saturation1 [18] for all configurations of Figure 9. In the case of 128-byte blocks, an architecture based on PSCR is not

saturated, and thus we can use configurations with a higher number of processors efficiently. When switching from 64 to 128

bytes, the decrease of the Execution Time is 27% percent for PSCR, and only a 20% for the other protocols. We observe that

1 We recall that a shared-bus shared-memory SMP system is in saturation [18] when the performance does not increase at least of a given quantity, when we add one processor to the machine.

16

a block size larger than 128 bytes produces diminishing returns, because the increased cost of read-block transaction is not

compensated by the reduction of the number of misses. Similar result has been obtained for a CC-NUMA machine running a

DSS workload [28]. In that work, the number of “Effective Processors” for a 16-processor CC-NUMA system was almost

the same that we obtained for our cheaper shared-bus shared-memory system (figure not showed).


00,050,1

0,150,2

0,25

MESI/64 MESI/128 AMSD/64 AMSD/128 PSCR/64 PSCR/128

Coherence Protocol / Block Size


Figure 11. Breakdown of Invalidation miss rate versus coherence protocol (AMSD, MESI, PSCR) and block size (64 byte,

128 byte). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Passive Sharing Misses decrease when increasing block size because the invalidation unit is larger.

N u m b e r o f C o h e re n c e T ra n s a c tio n s p e r 1 0 0 M e m o ry R e fe re n c e s

0

0 ,1

0 ,2

0 ,3

0 ,4

0 ,5

0 ,6

M E S I/6 4 M E S I/1 2 8 A M S D /6 4 A M S D /1 2 8 P S C R /6 4 P S C R /1 2 8C o h e re n c e P ro to c o l / B lo c k S iz e

F a ls e S h a r in g T ra n s a c tio n s (K e rn e l)T ru e S h a r in g T ra n s a c tio n s (K e rn e l)F a ls e S h a r in g T ra n s a c tio n s (U s e r )T ru e S h a r in g T ra n s a c tio n s (U s e r)P a s s iv e S h a r in g T ra n s a c tio n s

Figure 12. Number of coherence transactions versus coherence protocol (AMSD, MESI, PSCR) and block size (64, 128 byte).

Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches.

5.4.3 Analyzing the effects of variations in the number of processes of the workload We considered another scenario, where the number of processes in the workload may vary and thus the scheduler could

fail in applying affinity. The affinity scheduling could fail when the number of ready processes is limited. We defined a new

workload (DSS18, Table 2) having characteristics similar to DSS26 workload that used in the previous experiments, but

constituted of only 18 processes. The machine under study is still the 16-processor one. In such a condition, the scheduler

can only choose between at most two ready processes. The measured miss rate and number of coherence transactions

(Figures 14, 15, 16) shows an interesting behavior. The miss rate, and in particular Cold, Conflict+Capacity miss rate,

increases in respect to the DSS26 workload. This is consequence of process migration, and it is determined by the failure of

the affinity requirement: as the number of processes is almost equal to the number of processors, it is not always possible for

the system to reschedule a process on the processor where it last executed. In such cases, PSCR can reduce greatly the

associated overhead and it achieves the best performance (Figure 13).

17

The main conclusion here is that PSCR maintains its advantage also in different load conditions, while the other

protocols are more penalized by critical scheduling conditions.

Norm alized Execution T im e

0,85

0,9

0,95

1

1,05

1,1

1,15

1,2

1,25

1,3

ME SI/DS S26 MES I/DSS 18 A MSD/D SS26 A MSD /DSS 18 PS CR/D SS 26 P SCR /DSS 18

Coherence Protocol / Num ber of Processes

Figure 13. Normalized Execution Time versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 – DSS18). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches.

Miss Rate (%)

00,10,20,30,40,50,60,70,8

MESI/DSS26 MESI/DSS18 AMSD/DSS26 AMSD/DSS18 PSCR/DSS26 PSCR/DSS18Coherence Protocol / Number of Processes


Figure 14. Breakdown of Miss Rate versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 – DSS18). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. The workload DSS18 exhibits the higher miss rate, due to an increased number of Cold, Capacity, and Conflict Miss. This is a consequence of process migration, that

affinity fails to mitigate.


00.050.1

0.150.2

0.250.3

0.35

MESI/DSS26 MESI/DSS18 AMSD/DSS26 AMSD/DSS18 PSCR/DSS26 PSCR/DSS18

Coherence Protocol / Number of Processes


Figure 15. Breakdown of Invalidation Miss Rate versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 – DSS18). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Passive sharing misses

decrease, while true and false sharing misses increase. This is consequence of the failure of affinity scheduling: in the DSS18 execution. Processes migrate more than in DSS26 execution, thus generating more reuse of shared data (and more invalidation misses on shared data) but lower reuse of private data (and lower number of passive sharing misses).

18

Number of Coherence Transactions per 100 M emory References

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MESI/DSS26 MESI/DSS18 AMSD/DSS26 AMSD/DSS18 PSCR/DSS26 PSCR/DSS18

Coherence Protocol / Number of Processes

False Sharing Transactions (Kernel)True Sharing Transactions (Kernel)False Sharing Transactions (User)True Sharing Transactions (User)Passive Sharing Transactions

Figure 16. Breakdown of Coherence Transactions versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 –

DSS18). Data assume affinity, 64-byte block, 1M-byte 2-way set associative caches. Norm alized Execution T im e

0,8

0,85

0,9

0,95

1

1,05

1,1

1,15

1,2

1,25

1,3

1/16 1/8 1/4 1/2 1(Cache-to-Cache Latency/Read-Block Latency) Ratio

PSCRAM SDMESI

Figure 17. Normalized Execution Time versus coherence protocol (AMSD, MESI, PSCR) and the ratio between cache-to-cache transfer latency and read-block latency. Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set

associative caches. Execution Times are normalized with respect to the execution time of the MESI, 1/16 configuration.

5.4.4 Analyzing the effects of variations in the cache-to-cache transfer latency In our implementation of the coherence protocols, we assume that, when a data, which is missing in a local cache, is

present in a remote cache, that cache supplies the data, because the transfer between caches is usually faster than the transfer

between cache and memory (see Table 3). This is compatible with the early design of bus-based systems, and it derives from

the assumption that cache could supply data more quickly than memory. This is especially true in the case of small-scale

SMP architectures [9]. However, this advantage is not necessary present in modern high-end SMP architecture [9], such as

the Sun series based on the Wildfire [21] or the Fireplan bus [7], in which cache-to-cache latency is similar to the latency to

access the main memory. As previous studies of commercial workloads have highlighted that a large portion of misses can

be served by cache-to-cache transfer [25], [31], we analyze the effect on the Execution Time of different cache-to-cache

latencies. Figure 17 shows the Execution Time for the three protocols, when cache-to-cache transfer latency is varied. The

cases analyzed in previous analysis assumed a ratio of 1/16 for the ratio between cache-to-cache latency and read-block

latency.

19

Obviously, when we increase the latency, Execution Time increases. However, the architecture based on PSCR shows

less sensitivity to this parameter. This is consequence of the lower number of cache-to-cache transfers needed in PSCR with

respect to the other protocols. In fact, as a benefit of the WU strategy, shared data are never invalidated in PSCR. This also

generates several updated copies in several caches: a migrating process does not need to reload them from remote cache or

memory. When cache-to-cache latency is increased up to read-block latency, the advantage of PSCR is about 50% with

respect to the other two coherence protocols.

6 Conclusions We evaluated the memory performance of a shared-bus shared-memory multiprocessor running a DSS workload, by

considering several different choices that could improve the overall performance of the system. We considered different

architectures based on the following coherence protocols: MESI - a pure WI protocol, widely used in high performance

multiprocessors, - AMSD - a WI protocol designed to reduce effects of data migrations - and PSCR - a coherence protocol

using an hybrid strategy, that is WU for shared data and WI for private data, designed to reduce the effect of process

migration. The DSS workload was setup using the PostgreSQL DBMS executing queries of the TPC-D benchmark and

typical Unix shell commands and daemons. We considered kernel effects that are more relevant to our analysis like process

scheduling, virtual memory mapping, user/kernel code interactions.

Our conclusions, for the 4-processor case, agree with previous studies as for the analysis of miss rate and the effects of

coherence maintaining. Our analysis outlines also: i) cache sizes larger than 2M bytes already capture the working set of

such workload; ii) the kernel effects account for 50% of the coherence overhead. Previous studies that considered DSS

workloads were mostly limited to 4-processor systems, did not consider the effects of process migration, and did not

correlate the amount of sharing to the performance of the system.

Our analysis of a "high-end" machine considered a 16-processor SMP. We analyzed variations of classical cache

parameters, variations in the workload pressure on the scheduler due to a different number of processes, and variations of the

cache-to-cache latency. We found that in the high-end systems some factors, that were less noticed in the 4-processor case,

become more evident.

MESI protocol is not the best choice in high-end SMP architectures: AMSD improves the performance of a DSS system

of about 10% compared to MESI, PSCR improves the performance of about 20% compared to MESI when we use a low

cache-to-cache latency. DSS workloads running on SMP architectures generate a variable load. The affinity scheduler may

fail to deliver the affinity requirements. The use of PSCR allows us to build systems, whose performance is less influenced

by the load condition. The use of adequate protocols can deliver to DSS system an advantage, in case of SMP systems, with

high cache-to-cache latencies (up to 50% of improvement in our case).

References [1] A. Agarwal, Analysis of Cache Performance for Operating Systems and Multiprogramming. Kluwer Academic

Publishers, Norwell, MA, 1989.

20

[2] S. J. Eggers,“Simplicity Versus Accuracy in a Model of Cache Coherency Overhead”. IEEE Transactions on Computers, Vol. 40, No. 8, pp. 893-906, August 1991.

[3] L.A. Barroso, K. Gharachorloo, and E. Bugnion, “Memory System Characterization of Commercial Workloads”. Proc. of 25th Int.l Symp. on Computer Architecture, Barcelona, Spain, pp. 3-14, June 1998.

[4] Q. Cao, J. Torrellas, P. Trancoso, J. L. Larriba-Pey, B. Knighten, and Y. Won, “Detailed characterization of a quad Pentium Pro server running TPC-D”. Proc. of the Int.l Conf. on Computer Design, pp.108-115, Oct. 1999.

[5] R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum, “Scheduling and Page Migration for Multiprocessor Compute Servers”. Proc. of the 6th ASPLOS, pp. 12-24, Oct. 1994.

[6] J. Chapin, S. Herrod, M. Rosenblum, and A. Gupta “Memory System Performance of UNIX on CC-NUMA Multiprocessors”. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, pp. 1-13, May 1995.

[7] A. Charlesworth, “The Sun Fireplan System Interconnect”. Proc. of the Conf. on High Performance Networking and Computing (SC01), November 2001, http://www.sc2001.org/papers/pap.pap150.pdf.

[8] A. L. Cox and R. J. Fowler, “Adaptive Cache Coherency for Detecting Migratory Shared Data”. Proc. 20th Int.l Symp. on Computer Architecture, San Diego, California, pp. 98-108, May 1993.

[9] D. E. Culler and J. P. Singh, Parallel Computer Architecture: a Hardware/Software Approach. Morgan Kaufmann Publishers, San Francisco, CA, 1998.

[10] Z. Cvetanovic and D. Bhandarkar, “Characterization of Alpha AXP Performance Using TP and SPEC Workloads”. Proc. 21st Int.l Symp. on Computer Architecture, pp. 60-70, Apr. 1994.

[11] M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenström, “The Detection and Elimination of Useless Miss in Multiprocessor”. Proc. of 20th Int.l Symp. on Computer Architecture, San Diego, CA, pp. 88-97, May 1993.

[12] S. J. Eggers, T. E. Jeremiassen, “Eliminating False Sharing”. Proc. 1991 Int.l Conf. on Parallel Processing, August 1991, pp. I:377-381.

[13] S.J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm, and Dean M. Tullsen, “Simultaneous Multithreading: A Platform for Next-Generation Processors”. IEEE Micro, pp. 12-19, vol. 17, no. 5, Oct. 1997.

[14] P. Foglia, “An Algorithm for the Classification of Coherence Related Overhead in Shared-Bus Shared-Memory Multiprocessors”. IEEE TCCA Newsletter, pp. 40-46, Jan. 2001.

[15] K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors”. Proc. of the Fourth ASPLOS, Santa Clara, California, pp. 245-357, Apr. 1991.

[16] S. J. Eggers, R. H. Katz, “Evaluating the performance of four snooping cache coherency protocols “. Proc. of 16th Annual International Symposium on Computer Architecture, pp. 2-15, Jerusalem, Israel, 1989.

[17] R. Giorgi, C. Prete, G. Prina, and L. Ricciardi, “Trace Factory: a Workload Generation Environment for Trace-Driven Simulation of Shared-Bus Multiprocessor”. IEEE Concurrency, vol. 5, no. 4, pp. 54-68, October – Dec. 1997.

[18] R. Giorgi and C.A. Prete, “PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors”. IEEE Trans. on Parallel and Distributed Systems, pp. 742-763, vol. 10, no. 7, July 1999.

[19] S. J. Eggers, R. H. Katz, “A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation”. In. Proc. of 15th Annual Int. Symp. on Computer Architecture, pp. 373-382 Honolulu, Hl, May 1988.

[20] A. M. Grizzaffi Maynard, C. M. Donnelly, and B. R. Olszewski, “Contrasting characteristics and cache performance of technical and multi-user commercial workloads”. Proc. of the 6th ASPLOS, pp. 158-170, Oct. 1994.

[21] E. Hagersten and M. Koster, “WildFire: A Scalable Path for SMPs”. Proc. of the 5th Symp. on High-Performance Computer Architecture, pp. 172-181, Jan. 1999.

[22] J. Hennessy and D.A. Patterson, Computer Architecture: a Quantitative Approach, 3rd edition. Morgan Kaufmann Publishers, San Francisco, CA, 2002.

[23] R. L. Hyde and B. D. Fleisch, “An Analysis of Degenerate Sharing and False Coherence”. Journal of Parallel and Distributed Computing, vol. 34(2), pp. 183-195, May 1996.

[24] T. E. Jeremiassen, S. J. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations”. ACM SIGPLAN Notices, 30 (8), pp. 179-188, Aug. 1995.

[25] K. Keeton, D. Patterson, Y. He, R. Raphael, and W. Baker, “Performance characterization of a quad Pentium Pro SMP using OLTP workloads”. Proc. of the 25th Int.l Symp. on Computer Architecture, pp. 15-26, June 1998.

[26] D. Kroft, “Lockup-free Instruction Fetch/Prefetch Cache Organization”. Proc. of the 8th Int.l Symp. on Computer Architecture, pp. 81-87, June 1981.

[27] Jack L. Lo, Luiz A. Barroso, Susan J. Eggers, Kourosh G. Gharachorloo, Henry M. Levy, Sujay S. Pareck, “An analysis of Database Workload Performance on Simultaneous Multithreaded Processors”. Proc. of the 25th Annual Int.l Symp. on Computer Architecture, pp. 39-50, Barcelona, Spain, June 1998.

[28] T. Lovett, R. Clapp, “StiNG: A CC-NUMA Computer System for the Commercial MarketPlace”. Proc. of the 23rd Int.l Symp. on Computer Architecture, pp. 308-317, May 1996.

[29] C.A. Prete, G. Prina, and L. Ricciardi, “A Trace Driven Simulator for Performance Evaluation of Cache-Based Multiprocessor System”. IEEE Transactions on Parallel and Distributed Systems, vol. 6 (9), pp. 915-929, Sept. 1995.

[30] C. A. Prete, G. Prina, R. Giorgi, and L. Ricciardi, “Some Considerations About Passive Sharing in Shared-Memory Multiprocessors”. IEEE TCCA Newsletter, pp. 34-40, Mar. 1997.

[31] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. Barroso, “Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors”. Proc. of the 8th ASPLOS, pp. 307-318, San Jose, CA, 1998.

[32] T. Shanley and Mindshare Inc. Pentium Pro and Pentium II System Architecture. Addison Wesley, Reading, MA, 1999. [33] P. Stenström, M. Brorsson, and L. Sandberg, “An Adaptive Cache Coherence Protocol Optimized for Migratory

Sharing”. Proc. of the 20th Annual Int.l Symp. on Computer Architecture., May 1993. [34] P. Stenström, E. Hagersten, D.J. Li, M. Martonosi, and M. Venugopal, “Trends in Shared Memory Multiprocessing ”.

IEEE Computer, vol. 30, no. 12, pp. 44-50, Dec. 1997. [35] B. Stunkel, B. Janssens, K. Fuchs, “Address Tracing for Parallel Machines”. IEEE Computer, vol. 24 (1), pp. 31-45,

Jan.1991.

21

[36] P. Sweazey and A. J. Smith, “A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus”. Proc. of the 13th Int.l Symp. on Computer Architecture, pp. 414-423, June 1986.

[37] M. Tomasevic and V. Milutinovic, “The Cache Coherence Problem in Shared-Memory Multiprocessors –Hardware Solutions”. IEEE Computer Society Press, Los Alamitos, CA, Apr. 1993.

[38] M. Tomasevic and V. Milutinovic, “Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors”. IEEE Micro, vol. 14, no. 5, pp. 52-59, Oct. 1994 and vol. 14, no. 6, pp. 61-66, Dec.1994.

[39] M. Tomasevic and V. Milutinovic, “The Word-Invalidate Cache Coherence Protocol”. Microprocessors and Microsystems, pp. 3-16, vol. 20, Mar. 1996.

[40] J. Torrellas, M. S. Lam, and J. L. Hennessy, “Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates”. Proc. of 1990 Int.l Conf. on Parallel Processing, Urbana, IL, pp. 266-270, Aug. 1990.

[41] J. Torrellas, A. Gupta, and J. Hennessy, “Characterizing the caching and synchronization performance of a multiprocessor operating system”. Proc. 5th ASPLOS, pp. 162-174, Sept. 1992.

[42] J. Torrellas, M. S. Lam, and J.L. Hennessy, “False Sharing and Spatial Locality in Multiprocessor Caches”. IEEE Transactions on Computer, vol. 43, n. 6, pp. 651-663, June 1994.

[43] Transaction Processing Performance Council, “TPC Benchmark B (Online Transaction Processing) Standard Specification”. June 1994. http://www.tpc.org.

[44] Transaction Processing Performance Council, “TPC Benchmark D (Decision Support) Standard Specification”. December 1995. http://www.tpc.org.

[45] P. Trancoso, J. L. Larriba-Pey, Z. Zhang, and J. Torrellas, “The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors”. Proc. of the 3rd Int.l Symp. on High Performance Computer Architecture, pp. 250-260, Los Alamitos, CA, Feb. 1997.

[46] R. Uhlig, T. Mudge, “Trace-Driven Memory Simulation: a survey”. ACM Computing Surveys, pp. 128-170, June 1997. [47] C. A. Prete, G. Prina, R. Giorgi, L. Ricciardi, “Some Consideration about Passive Sharing in Shared-Memory

Multiprocessors”, IEEE TCCA Newsletter, March 1997. [48] K. C. Yeager, “The MIPS R10000 Superscalar Microprocessor”. IEEE Micro, vol. 16, n. 4, pp. 42-50, Aug. 1996. [49] A. Yu and J. Chen, “The POSTGRES95 User Manual”. Computer Science Div., Dept. of EECS, UCB, July 1995. [50] K. Keeton, R. Clapp, A. Nanda, “Evaluating Servers with Commercial Workloads”. IEEE Computer, Vol. 36, N. 2, pp.

29-32, Feb. 2003. [51] A. S. Tanenbaum, “Structured Computer Organization, 4ed.”. Prentice-Hall, Inc, 2001. [52] R. Yu, L. Bhuyan, R. Iyer, “Comparing the Memory System Performance of DSS Workloads on the HP V-Class and

SGI Origin 2000”. In Proc. of the Int. Parallel and Distributed Processing Symposium, Fort Lauderdale, FL, April 2002. [53] Ballinger C., “Relevance of the TPC-D Benchmark Queries:The Questions You Ask Every Day", NCR parallel System,

http://www.tpc.org/information/other/articles/TPCDart_0197.asp. [54] A. Ramírez, J.L. Larriba, C. Navarro, X. Serrano, J. Torrellas and M. Valero. "Code reordering of decission support

systems for optimized instruction fetch". IEEE International Conference on Parallel Processing. Japan, Sept. 1999. [55] P. Foglia, R. Giorgi, C.A. Prete: Analysis of Sharing Overhead in Shared Memory Multiprocessors. 31st IEEE Hawaii

Int. Conf. on Systems, Vol. 7, pp-776-778, Kohala Coast, HL, January 1998. [56] S. J. Eggers, R. H. Katz, “The Effect of Sharing on the Cache and Bus Performance of Parallel Programs”. Proc. 3rd

ASPLOS, pp. 257-270, Boston, MA, April 1989.

Toward Optimization of OpenMP Codes for Synchronization

and Data Reuse

Tien-hsiungWeng and Barbara Chapman Computer Science Department

University of Houston Houston, Texas 77024-3010

thweng, [email protected]

Abstract

In this paper, we present the compiler transformation of OpenMP code to an ordered collection of tasks, and the compile-time as well as runtime mapping of the resulting task graph to threads for data reuse. The ordering of tasks is relaxed where possible so that the code may be executed in a more loosely synchronous fashion. Our current implementation uses a runtime system that permits tasks to begin execution as soon as their predecessors have completed. A comparison of the performance of two example programs in their original OpenMP form and in the code form resulting from our translation is encouraging.

1 INTRODUCTION

OpenMP [1] succeeds as a popular parallel programming interface for medium scale high performance applications on shared memory systems due to its portability, ease-of-use and support for incremental parallelization. Writing OpenMP programs is relatively easy, in contrast to other parallel programming models such as MPI. Users may achieve reasonable performance by adding just a few OpenMP directives. However, it requires a non-trivial effort to obtain scalable OpenMP codes for ccNUMA systems. In order to make OpenMP a parallel programming model for a system consisting of a large number of processors, synchronization in the code must be minimized. An additional potential cause of performance degradation on ccNUMA (cache coherent Non-Uniform Memory Access) systems is that a thread may need to access a remote memory location; despite substantial progress in interconnection systems, this still incurs significantly higher latency than accesses to local memory, and the cost may be exacerbated by network contention. If data and threads are unfavorably mapped, cache lines may ping-pong between locations. However, OpenMP provides few features for managing data locality. In particular, it does not enable the explicit binding of parallel loop iterations to a processor executing a given thread; such a binding might allow the distribution of work such that threads can reuse data.

An alternative approach that can provide high performance on ccNUMA platforms is for the user to rewrite the OpenMP code, decomposing the data structures into portions that will be private to each thread, creating buffers to hold shared data and arranging for the appropriate synchronization. Yet this so-called Single Program Multiple Data (SPMD) OpenMP style requires considerable reprogramming. We attempt to relieve users from the extra burden of creating SPMD code and to help them achieve good performance for their original programs by transforming standard OpenMP codes to tasks, relaxing synchronization between tasks where possible, and mapping them to processors for data reuse.

2 OUR APPROACH

We target OpenMP to a data flow execution model realized using the SMARTS runtime system [7][8] developed at Los Alamos National Laboratory. At runtime, SMARTS employs a master thread to orchestrate the work of slave threads that execute the tasks. The tasks, a graph that represents the ordering of their execution and the data read and written by each task are passed to the runtime system. Before a task can run, it must make read/write requests for the data it needs. As soon as the data is available, the master thread schedules the task to a slave queue to be executed. The slave thread may be chosen by the compiler or by the runtime system. SMARTS also supports work stealing, where a slave thread takes one or more tasks from other queues.

The compiler transforms an OpenMP code to a collection of sequential tasks and a task graph, which represents the dependences between the tasks. This translation is based heavily on array region analysis, which must be applied interprocedurally. One of our goals is to perform as much work as possible to avoid expensive runtime computation of dependences; where necessary, code will be generated to complete the task graph and finalize the mapping to processors at run time. The performance of the resulting code will be determined by the suitability of the code for this approach, the quality of the translation, and the amount of data reuse by each processor to compensate the overheads incurred by the SMARTS runtime system. These overheads include managing requests for read and write data, as well as scheduling of tasks into task queues by SMARTS.

The compiler may specify the mapping of tasks to processors (or equivalently, to threads that are bound to processors) or may leave this to the runtime system. Finding a good compile-time mapping of nodes of the task graph to the processors for data locality reuse is the main focus of this paper. We use this mapping when we generate parallel code that includes calls to SMARTS. The resulting code is executed using the so-called first touch policy, where data will be stored so that it is local to the task that first accesses it. All tasks that use this data will be assigned to the same thread wherever possible to maximize data locality. We describe and discuss the mapping algorithm that we have developed below.

This paper is organized as follows. In the next section, we first discuss the generation of tasks and the task graph. We then present the algorithm for scheduling the tasks in the graph to threads for data locality reuse. Our strategy for code generation is also discussed. Then, in Section 4, we illustrate our work using an LU kernel. We also use an ADI code example to illustrate how our mapping algorithm enables macro-pipelined computations. Then, experimental results are presented. Finally, Section 5 presents related work and Section 6 our conclusions and future work.

3 CREATION AND MAPPING OF THE TASK GRAPH

For the purposes of this paper, we call a unit of sequential code that will be executed by a single thread a task. Such a task is an arbitrary code sequence, such as a set of iterations of a loop nest, or one or more procedures, without synchronization constructs. An OpenMP program will be decomposed into an ordered collection of tasks according to the semantics of the language. The ordering will be represented by the task graph. A task graph for an OpenMP program, denoted by G(N,E) consists of a set of nodes N=t1, t2, ..,tm, where each node represents a task in the decomposed program, and a set of edges E between nodes, where ei,j is an edge from node ti to node tj. Each edge represents synchronization in the sense that the source node must be executed before the sink node at run time. As part of our translation strategy, we will eliminate as many edges as possible from this graph in order to reduce the amount of synchronization imposed on the executing code. We will also use this graph and our analysis to map the tasks in the program to threads. Note that we bind exactly one thread to each target processor so that we may refer to

threads and processors interchangeably. We assume that processors (and threads) are numbered consecutively, beginning with 1.

3.1 Generation of tasks and task graph

OpenMP does not restrict work-sharing constructs to be within the lexical scope of the parallel region; they can occur within a subroutine or a procedure that is invoked, either directly or indirectly from inside a parallel region. Parallel regions may contain long call chains. Thus the work required to create tasks is of necessity interprocedural. We have developed an efficient call graph construction algorithm [34] to support recursive calls and dynamically bound calls arising from invocations of procedure-valued formal parameters. It provides a precise traversal of the generated call graph, since additional traversal information is saved. It has been implemented in our Dragon tool [36] that is based on Open64 [35]. The traversal algorithm is used to create task identification and gather array section information in an exact call chain of the program call graph. For each task, we collect the following information: 1) read/write access to array sections; 2) the code segment; 3) information on (possible) loop iterations; 4) synchronization constraints.

We follow the semantics of OpenMP worksharing constructs (DO, SECTIONS, MASTER, and SINGLE, etc.) to generate the sequential tasks. For example, the iterations of a parallel loop with a default schedule are assigned to threads so that each has a contiguous set of iterations of approximately the same size. If the loop does not contain any further worksharing or synchronization constructs, each such set of iterations would be a task in our program representation. A SINGLE or MASTER construct each represents a task, while a CRITICAL SECTION will be translated to one task per thread, each with the same code segment. Sequential regions of a code may be divided into basic blocks, each of which may form a task. The complete sequence of tasks of a program is generated interprocedurally in a traversal of our call graph. A typical OpenMP program will contain barrier synchronizations, and possibly atomic updates or critical sections to ensure ordered access to shared data. These synchronizations will be represented explicitly in the task graph that is also constructed. However, many of the resulting edges are in fact superfluous in this representation. For instance, an OpenMP code may require a barrier between two parallel loop nests, although the data reuse between them only occurs between two of the resulting tasks. All other tasks can be executed asynchronously.

In order to determine which edges can be eliminated from the graph, the compiler performs an array section analysis to determine as accurately as possible each task’s write and read accesses to shared data objects in the program. If we can use this analysis to prove that a pair of threads that are connected by an edge do not access the same data (or such accesses are reads only), then we may remove the corresponding edge from the task graph.

We have chosen to use standard regular sections, or the triplet notation, to represent data read and written, because of the efficiency with which we can compare them ([15], [9]); however more accurate analyses have been proposed, e.g., simple sections [10], regions [11][12][13], and the Omega test [17] and these are potential alternatives for compile time analysis when precision is critical. Efficiency and practicality are key for runtime synchronization analysis and regular sections are the natural choice for our runtime strategy where we must avoid overheads.

In order to improve the accuracy of our analysis, we use a delayed union approach when determining the region of an array that is accessed by a thread. Under this approach, two referenced sections of an array are merged if the merge does not lead to a loss of accuracy. Otherwise, each individual section is kept in a list. If a new reference cannot be merged with an existing regular section, it is added to the list until a threshold number of entries is achieved, in which case union will be enforced. This method may not improve accuracy in many cases.

The need for synchronization is given by comparing the data written by a thread with that which is read and written by a thread to which it is linked via an edge in the task graph. Where

compile-time analysis can prove that there is no common data, the edge is deleted from the graph. Where results are imprecise, runtime tests may be inserted to determine the need for an edge during program execution. It is worthwhile to do so when the test is fast, such as checking the range of a specific constant value.

When the number of threads is known in advance, task generation is straightforward. Otherwise, more work is needed at runtime. In the former case, each task is created, ordering relationships are determined based upon OpenMP semantics, and the data read and written by each task is computed. Task sequences and its read/write accesses to regular sections of arrays are also obtained. For example, we present the algorithm for translating the DO construct in Figure 1.

To illustrate the above algorithm, we use the code in Figure 2 and assume that NT is 2. Tasks will be created by subdividing the work in the outer loop according to the default schedule. If eight tasks have already been created when this loop is encountered, then it will generate two new tasks, t9 and t10, to perform the work in the loop. The first of these tasks will perform the first 50 iterations and the second will perform the remaining 50. Accordingly, task t9 has bounds I=1,50 and J=1,100 and will write the array section A[1:100, 1:50]. Task t10 has bounds I=51,100 and J=1,100 and writes to A[1:100, 51:100]. These tasks must both wait until all tasks involved in the previous construct have been completed, unless the NOWAIT clause was specified there.

Each of the SINGLE, MASTER and SECTION constructs corresponds to one task. So for the code in Figure 2, task t11 will be created to execute the code within the SINGLE construct that has write access to array section A[1:1,1:1]. Both t9 and t10 will be predecessors of t11. By applying

Figure 1 The generation of tasks from an OpenMP parallel loop

Let NT be the number of threads and tk be the last task before this loop DO I = 1, NT k = k + 1

Compute bounds of tk according to loop schedule Compute array accesses by tasks tk Add tk to N ENDDO

Figure 2 Example of DO

!$OMP PARALLEL … !$OMP DO DO I= 1,100 DO J=100 A(J,I) = ENDDO ENDO !$OMP SINGLE A(1,1) = !$OMP END SINGLE !$OMP END PARALLEL

For each e i,j in E do If (ti.RWaccess ∩ tj.RWaccess)=φ then Remove ei,j from E For each tk ∈ pr(ti) do If (tk.RWaccess ∩ tj.RWaccess)≠φ then Add ek,j to E Compute ek,j .reuseLevel Endif Endfor Else Compute ei,j.reuseLevel Endif Endfor

Figure 3 Algorithm for eliminating edges from task graph

the algorithm of Figure 3, we can remove edge e10 ,11. Let there be an edge from ti to tj in the task graph and let Ri be the data read by task ti and Wi

be the data it writes. As already described, we must compare the data written in each with the data read and written in order to determine whether we can remove the edge. If we can do so, we must compare the data accessed by task tj with that accessed by all predecessors of task ti in the graph. For the purpose of scheduling, we are also interested in situations where two tasks read the same data. This information will also be recorded in the form of a special edge.

If there is an edge between node ti and tj in the modified task graph, the compiler must further determine the level of data reuse between the two tasks; the result will be appended to the edge and denoted by ei,,j.reuseLevel. The reuse level is determined by the amount of data accessed by both source and sink tasks, and the type of dependence between them. For example, if both tasks write the data, the reuse level will be higher than if both read the same data. The execution time of each task ti in the task graph, w(ti) is statically estimated by the number and type of operation and operand, available information on loop bounds, etc. An alternative would be to provide a profile option. We may also be able to provide better estimates at run time, when symbolic bounds etc. can be evaluated.

Note that tasks may be subdivided to improve their cache behavior as well as to isolate iterations that give rise to synchronization, and that reordering of computation within a task may support the former but not the latter. Hence, techniques to determine appropriate granularity and to move constructs causing synchronization are required. This topic is not discussed in this paper.

3.2 Mapping algorithm for data locality reuse

We use a heuristic algorithm to map tasks in the task graph to processors (Figure 4). It is based upon a list scheduling approach [33]. The goal of the algorithm is to map tasks that use the same data to the same processor. However, at each step of the way, it must decide which one is the next task to map: it maps “more important” tasks first when possible, where this is determined by the weight of a node.

The algorithm consists of an initialization and an iterative step. In the initialization step, the algorithm first computes the level of each task ti by determining the sum of the weights on paths from ti to the terminal node, taking both edge and task weights into account. It sets level(tj) to the greatest weight on such a path. It also computes #ipr(tj), the number of immediate predecessors of tj. The function Procs(ti) = pk records mapping decisions, i.e., task ti is mapped to processor pk; initially, it is set to NULL.

At any stage during the process, the ready queue will hold those tasks that are ready for mapping but not yet assigned to processors. The algorithm initially puts all tasks without any predecessors in the task graph (“initial tasks”) into the ready queue (RQ) and orders them based upon their level; then it selects the highest priority (highest level) task tcurrent from RQ as the next ready task, assigns it to processor 1, and removes tcurrent from RQ.

The iterative step successively selects tasks and allocates them. In step 2.1, the algorithm will select the most important of the immediate successors of tcurrent by examining those for which data reuse between it and tcurrent is highest. If this task is subsequently chosen for mapping, it will be allocated to the same processor as tcurrent. It uses the two functions list_max_isucs() and max_ipred() to do so. The function list_max_isuccs() will return a list of those immediate successors of ttemp for which the highest reuse level is to be found on the edge from tcurrent. In other words, suppose ts is one of tcurrent’s immediate successors with two incoming edges from tcurrent and some other task tk. If ek,current.reuseLevel > es,current.reuseLevel, then ts will not be included in the list returned by this function. The function Max_ipred() selects the task from those returned by list_max_isuccs() that corresponds to the edge from tcurrent with the highest reuse level. This edge,

ttemp, will be temporarily mapped to the processor that executes tcurrent, and its incoming edge from tcurrent’s is marked as visited. The visited edges only affects list_max_isucs() and max_ipred().

We next compare the priority (weight) of the successor task chosen with the highest priority task in the ready queue. Whichever one has highest priority will be mapped next; we call it tnext. Before continuing, we now add any tasks to RQ if all their successors have already either been assigned to processors or placed in RQ. In step 4, we remove the next task from RQ, if necessary, and fix its mapping. If it is an immediate successor of tcurrent, then it will be mapped to the same processor. Otherwise, we call function Locate_Processor to check if it has been temporarily mapped to a processor; if it has, this function will map it to that processor. Otherwise, the function will find a processor that is idle for mapping. Finally, we set tnext to be the new tcurrent before beginning the next iteration.

Input: Task Graph G(N,E) and m processors of distributed shared memory system. Output: procs(ti) = pk for each task ti, the assignment of tasks to processors Begin

1. Initialization step: 1.1 For each task ti in G do

1.1.1 Compute level(ti) and #ipr(ti) 1.1.2 Initialize Procs(ti) to NULL

1.2 All initial tasks are added to RQ (Ready Queue) ordered by level number 1.3 Get highest priority task in RQ and set it to tcurrent. Remove it from RQ.

Procs(tcurrent) = 1 Repeat 2. Select next task for mapping:

2.1 Select best immediate successor of tcurrent: If tcurrent has immediate successor then ttemp = max_ipred(list_max_isuccs(tcurrent)) temporary map ttemp to procs(tcurrent) mark_visited(inedge(tcurrent)) else ttemp = NULL Endif

2.2 Compare priority with tasks in RQ and select tnext: tnext = maxweight(ttemp, RQ) /* task with max priority of ttemp and RQ */

3. Process successors of tcurrent: 3.1 For each ti ∈ isuccs(tcurrent) Do

#ipr(ti) = #ipr(ti) – 1 If (#ipr(ti)==0) then

Add ti to RQ in priority order Endif Endfor

4. Map next task: 4.1 If tnext is in RQ then Remove tnext from RQ

If tnext is immediate successor of tcurrent then Procs(tnext) = Procs(tcurrent)

else Locate_processor(tnext, PL); Procs(tnext)=Procs(PL) Endif If not visited(inedge(tnext)) then Mark_visited(inedge(tnext)) Endif

5. Update current task: tcurrent = tnext

Until all tasks in G are assigned End

Figure 4 Data reuse mapping algorithm

3.3 Code Generation

3.3.1 Static case

In the static case, the compiler collects as much information as possible to avoid runtime overheads and to generate code that passes complete information to the runtime system. Such information includes task identification, possibly loop iterations, task mapping information, read/write access information (pass task graph information if other runtime system is used). This is necessary so that the task can be executed on the processors for data reuse and can be out-of-order execution for those independent tasks.

After compile-time task mapping has been performed, we can generate source code such as that in Figure 5 as an example from Figure 2. The main program is first executed by the master thread, which creates additional threads that each bounds to processor and pass information stated above. In Figure 6, task_loop0 and task_single are routines for tasks correspond to DO construct and SINGLE construct of Figure 2, respectively.

3.3.2 Dynamic case

Most application programs have a significant number of input-data dependent unknowns, and the number of processors may also be unknown at compile time. Compile-time analyses can

call SM_concurrency(n) /*to create n-1 slave threads to execute tasks*/ call SM_define_array(info) /* SMARTS Initialization */ schedtype = hardaffinity call SM_startgen() do i = 1, iter k = k + 1 call SM_def_readwrite(task_loop0, k) call SM_getinfo(task_loop0, affin, datinfo) call SM_handoff(task_loop0, schedtype, procs(task_loop0,k), datinfo) end do k = k + 1 call SM_def_readwrite(task_single,k) call SM_handoff(task_single, schedtype, procs(task_single,k), datinfo) call SM_run() call SM_end()

Figure 5 Main program using SMARTS

subroutine task_loop0(datinfo) use shared_DATA call SM_compute_bound(datinfo,lb1,ub1,lb2,ub2) do j = 1b1, ub1 do i = lb2, ub2 A(i,j) = enddo enddo end subroutine task_loop0 subroutine task_single(datinfo) use shared_DATA A(1,1) = end subroutine task_single

Figure 6 Parallel loops using SMARTS

consequently produce symbolic expressions, despite the application of interprocedural constant propagation and other techniques for maximizing available information about variables. Therefore, symbolic analysis is required to support compile time analyses as well as to reduce the cost of runtime analysis. Several techniques have been proposed for evaluating symbolic expressions and for expressing information about program variables at arbitrary program points, as well as for aggressive simplification of a system of constraints of symbolic expressions, determining the relationships between symbolic expressions, and computing their lower and upper bounds.

Blume and Eigenmann [27] use abstract interpretation to extract information about variable ranges. Haghighat [25] presents a variety of symbolic analysis techniques based on abstract interpretation. Fahringer [26] presents a unified symbolic evaluation framework that statically determines the values of variables and symbolic expressions. He uses the Omega library[18] for simplifying first order logic expressions and constraints.

We are working on the generation of a routine that computes the array sections of those symbolic bounds and symbolic array sections that have not been resolved at compile time. (At compile time, the symbolic analysis consists of an aggressive simplification of a system of constraints of symbolic expression, determining the relationship between symbolic expressions, computing lower and upper bound of symbolic expression.) In this case we may need to (partially) construct the task graph and perform the task mapping algorithm at run time.

4 EVALUATION AND EXPERIMENTAL RESULTS

To illustrate how the mapping algorithm of Figure 4 works, we present a simple example task graph in Figure 7. After step 1.1, the levels of the nodes for tasks t1, t2, t3, t4, and t5 are 81, 101, 1, 1, and 1, respectively, and the numbers of their immediate predecessors (#ipr) are 0, 0, 1, 2, and 2, respectively. In step 1.2, two independent tasks: t1 and t2 are put into the RQ ordered by the node level or priority. In step 1.3, task t2 is selected and removed from RQ as ready task tcurrent, and t2 is assigned to processor p1.

In step 2.1, t2 as tcurrent is passed as input parameter to list_max_isuccs(), in which t2 has two immediate successors: t4 and t5. Task t4 has two incoming edges: e1,4 and e2,4 with weights of 80 and 100, respectively. Since e2,4.reuseLevel is the greatest, it is in the list. Similarly, t5 has two incoming edges: e1,5 and e2,5 with their weights of the same value, so e2,5 is also in the list; list_max_isuccs(ta) returns (e2,4),(e2,5) as input to Max_ipred(), and since e2,4.reuseLevel has the maximum value, it returns t4 to ttemp and is temporarily mapped to p1. It then marks e2,4 and e1,4 as visited, indicated by the dotted line in Figure 8, needed for the internal workings of the function Max_ipred(). In step 2.2, t4 is compared with task t1 that has highest priority in RQ, t1 is assigned to tnext. In step 3, #ipred(t3) and #ipred(t4) are decremented to 1. In step 4, t1 is removed from RQ. Task t1 is not the immediate successor of t2, therefore Locate_processor() return p2, since t2 has not been temporarily mapped to any processor. Next, t1 become tcurrent.

Figure 9 Mapping of Fig. 7

T4

T2

T3 T5

T1

P1 P2 P3

Figure 7 Example task graph

1

1

70 20

81 101 1

1

1

100 20 80

1 1 1

t1 t2

t3 t4 t5

Figure 8 After Mark visited

1

1

70 20

81 101 1

1

1

100 20 80

1 1 1

t1 t2

t3 t4 t5

In the next iteration of step 2.1, the best successor of t1 is t3, so ttemp is t3, which is temporarily mapped to p2. It marks e1,3 visited. In step 2.2, t3 is compared with the maximum priority task in RQ; since RQ is empty, t3 is assigned to tnext. Now in step 3, #ipr(t3),#ipr( t4), and #ipr(t5 ) are decremented to 0, they are put into RQ. In step 4, since t3 is in RQ, it is removed. Because t3 is immediate successor of t1, t3 is mapped to thread that executes t1, which is p2. t3 becomes tcurrent.

In next iteration of step 2.1, t3 does not have a successor, so ttemp is NULL. In step 2.2, tnext is t4. In step 4, t4 is in RQ, and is thus removed from RQ; since t4 has been temporarily mapped, Locate-processor()map it to p1. Then, t4 becomes tcurrent. Finally, in the last iteration of step 2.1, t4 does not have a successor, so tnext is t5. In step 4.1, t5 is removed from RQ and mapped to p3 by locate_processor() since it has not been temporarily mapped. The result is shown in Fig. 9.

In the following, we discuss two small codes, one an LU decomposition and the other an ADI iteration, that have been translated using our approach.

4.1 LU example code to demonstrate the data reuse

We present the OpenMP LU example in Figure 10 to illustrate the generation of tasks, task graph, and the data locality reuse scheduling algorithm. Task tk is generated for k from 1 to 5 respectively, corresponds to the SINGLE construct. When k is 1, the memory access pattern of task t1 involves write access to section [2:5,1:1] of array A and read reference to regular section [1:5,1:1] of array A is obtained from [2:5,1:1] ∪ [1:1, 1:1]. Similarly, task t2 involves write access to array A[3:5,1:1] and read access to [2:5,1:1]. All array sections in the example can be exactly represented when we further decompose a block of data into several columns. This requires delayed union operation.

A collection of tasks tkc where k =1,5 and c is the chunk number which is obtained from the OpenMP DO construct, is generated from the code of Figure 10. It depends on the loop size, the number of threads and the scheduling chosen; in the present example, it is static scheduling, the chunk size is equal to one, and the number of threads is 4. The reference access to array A of t11 is [2:5,2:2] ∪ [2:5,1:1] ∪ [1:1,2:2]. Our delayed union algorithm does not summarize [2:5,1:2] ∪

Figure 10 Simple OpenMP LU code example

Program LU integer n parameter (n=5) double precision a(n,n) !$OMP PARALLEL do k=1,n !$OMP SINGLE do m=k+1, n a(m,k) = a(m,k) / a(k,k) enddo !$OMP END SINGLE !$OMP DO PRIVATE(i,j) do j=k+1,n do i=k+1,n a(i,j)=a(i,j)- a(i,k)*a(k,j) enddo enddo !$OMP ENDDO !$OMP END PARALLEL

TK

TKC

T1

T41

T14 T13 T12 T11

T32 T31

T23 T22 T21

T2

T3

T4

Figure 11 Task graph and memory access pattern

[1:1,2:2] into one representation. Similarly, access to array A by t12 is [2:5,3:3]∪ [2:5,1:1]∪ [1:1,3:3] which can be precisely summarized only to [1:5,3:3]∪ [2:5,1:1]. If t12 is summarized directly to [1:5,1:3], then there will be dependences between t12 and t2, which is a false dependence. Hence, the delayed union operation provides superior accuracy in this case. Figure 11 shows the task graph of the corresponding program example in Figure 10. It also shows the access pattern of each task. The dark shade represents read and write accesses, while the lighter shade represents read only accesses to array A.

In the OpenMP static scheduling shown in Figure 12, we assume that the number of threads is four, and the work is executed by the first touch policy. It is clear that the data accessed by a processor will not be the same as that accessed in the next iteration of the k loop. Consider the scheduling of OpenMP in Figure 12; after t1 and t11 have been scheduled to processor p1, both tasks were the first that touch columns 1 and 2 respectively. Task t12 in processor p2 first touches column 3 of array A, p3 first touches column 4, and p4 first touches column 5. There is no problem with t2, since it reuse column 2. But t22, t23 absolutely cannot reuse the data that has been brought into the local memory, since in p2, t22 reads column 2, and reads and updates column 4, where its previous t12 first touched the column 3 on the same processor. Hence there is no reuse and it needs to access data from the remote memory.

In addition, there are synchronization overheads in OpenMP, for instance, t2 needs to wait for all t11, t12, t13, and t14 to finish; therefore, t2 can only be executed after t14. We remove synchronization overhead by translating the OpenMP code to task graph; then t2 only needs to wait for t11, it does not need to wait for t12, t13, and t14 to finish. Our mapping is scheduled at runtime by SMARTS runtime system and the result is shown in Figure 13. It can be easily observed that our results will schedule for data reuse, giving better locality reuse.

4.2 ADI example exploiting macro-pipelined computation for data reuse

Pipelining computation and communication have been exploited to improve the performance

of many scientific applications [21][23]. Brandes and Desprez [21] integrated pipelined computation successfully within an HPF using compiler called ADAPTOR and gave impressive experimental results. Pipelining has been used by Chamberlain at al. [29] to exploit parallelism in wavefront computations. They introduce extensions to array languages to support pipelining

t11

t1

t12 t13 t14

t21

t2

t22 t23

t3

t31

t4

t32

t41

Figure 13 Our scheduling based on data locality reuse on four processors

Figure 12 Static scheduling on OpenMP on four processors

t11

t1

t12 t13 t14

t21

t2

t22 t23

t3

t31

t3

t31

t41

wavefront computation. Gonzalez at al. [23] propose a set of extensions to OpenMP to allow complex pipelined computations. This is done by defining directives in terms of thread groups and precedence relations among tasks originated from work-sharing constructs, as ours are. It assumes that heuristic scheduling is available to achieve this. But they do not propose any extensions explicitly for macro-loop pipelined computation.

Figure 14 presents a Fortran ADI kernel written in OpenMP. The outer loop of the first parallel loop is parallelized so that each thread accesses a contiguous set of columns of the array A. The outer loop of the second loop is also parallelizable, but as a result each thread accesses a contiguous set of rows of A. The static OpenMP standard execution model is shown in Figure 15, and it can be easily seen that it has poor data locality. Even though we can interchange the second loop so that it also accesses A in columns, parallelism is limited as a result of the overheads of the DO loop and the lack of data locality; and the effects of this can be easily observed. Our strategy for handling this situation is to create a version in which threads access the same data and the second loop is executed in a macro-pipelined manner. The task graph that was created for this transformed code is shown in Figure 16. When we apply our mapping algorithm to this task graph, it realizes a macro-loop pipeline wavefront computation that enables data locality reuse.

!$OMP PARALLEL !$OMP DO do j= 1, N do i= 2, N A(i,j)=A(i,j)-B(i,j)*A(i-1,j) end do end do !$OMP END DO !$OMP DO do i= 1, N do j= 2, N A(i,j)=A(i,j)-B(i,j) * A(i,j-1) end do end do !$OMP END DO !$OMP END PARALLEL

Figure 14 ADI Kernel code using OpenMP

T1 T2 T3 T4

T6 T7

T7 T8 T9

T5

Figure 15 OpenMP standard execution of Figure 14

t1 t2 t3

t4 t5

t6 t7

t8

t9 t10

t11

t12

Figure 16 Macro-pipelined computations of Figure 14

4.3 Experimental result

We compare OpenMP versions of the LU kernel and the ADI kernel with the code translated according to our scheme and using SMARTS as a runtime for code. Our target platform was an Origin 2000 at the National Center for Supercomputing Applications (NCSA). All codes were compiled with SGI’s MIPSpro Fortran 90 compiler under the options –64 –Ofast –IPA and run on MIPS R10000 processor at 195 MHz with 32 Kbytes of split L1 cache, 4 Mbytes of unified L2 cache per processors and 4 Gbytes of DRAM memory. We made use of the first touch allocation strategy provided by SGI, where a datum is stored on the node where it is first accessed. We also set _DSM_MIGRATION (page migration) OFF.

LU Kernel Code

0

10

20

30

40

50

60

1 2 4 8 16 32


Exe

cutio

n ti

me

in s

ecs

OMP MIPSpro

OMP Reuse

SMARTS

We show results for the LU computation using a 1024 by 1024 double precision matrix in Figure 17. We compile the original OpenMP LU code compiled with MIPSpro is labeled MIPSpro. The special hand written OpenMP LU code for scheduling reuse compiled with MIPSpro compilers. Here we see that the SMARTS runtime system introduces non-trivial overheads. The master/slave scheme controls the distribution of work to slave threads. The coordination incurs the overhead for the runtime system. In addition, SMARTS does not accept task graph information in the form of precedence constraints between tasks; instead, it requires information on the read/write access tasks makes to data objects, using these to (re)compute the data dependences at runtime. Therefore, it incurs significant overheads while recomputing information that is already available. We initially chose SMARTS in order to experiment with various strategies for runtime scheduling. Unfortunately, the high overheads in this case make it hard to show the benefits of our approach. In future, we plan to experiment with other runtime systems that suit this need. At the same time, we also want to investigate and develop a method to reduce these overheads possibly by minimizing the number of data objects passing to the runtime system, while avoiding the false dependences caused by decomposition of array into appropriate size.

In this experiment, the only array A in the program it is declared as shared. In addition, we avoid false dependences by breaking the work into smaller tasks. In SMARTS terminology, each array region accessed by a task is a data object. Since the SMARTS runtime system needs to handle many small data objects as a result of our work decomposition, the initialization overheads caused by passing information to the runtime system leads to a slightly longer execution time for the translated code on small numbers of processors. Even though the overhead incurred by the

Figure 17 LU kernel Code

runtime system is high in this case, it is amortized by the good data reuse as the problem size and the number of iterations increases.

ADI Kernel Code

0

1

2

3

4

5

6

7

1 2 4 8 16 32


Exe

cuti

on t

ime

in s

ecs. OMP MIPSpro V1

OMP MIPSpro V2

SMARTS

Figure 18 illustrates our results on ADI kernel code. There are two version of OpenMP code.

Version one parallelizes the outer loop along i loop of Figure 14. Where version two interchanges the loop and parallelizes the inner loop along i loop. For our measurement, we executed the code with an array of size 1024 by 1024. Our translated code to dataflow execution model enables macro-pipelined computation labeled as SMARTS. In the translated code, the number of data objects and read/write requests depend on the decomposition of array A and B. Since it does not need to avoid false dependence by the decomposition of arrays into smaller block size for this code, there is no substantial overhead, for even with a few processors. With an increasing number of processors, translated code also gave a better performance and scale well up to 32 processors.

5 RELATED WORK

The method provided by OpenMP for enforcing locality is to declare variables to be private to threads. One of the best ways to get high performance is to write code in the SPMD (Single Program Multiple Data) style which involves a systematic privatization of program data [19][20]. However, this may require significant reprogramming, which undermines one of the major benefits of the OpenMP API. The use of OpenMP and MPI [5] together also works, but at the price of considerable additional program complexity. In addition, the program will be more difficult to maintain. Several approaches have been proposed to help the user obtain good performance with modest programming effort, including extending OpenMP by data distribution directives and directives to link parallel loop iterations with the node on which specified data is stored [2][3], first touch memory allocation, dynamic page migration by the operating system [4], user-level page migration [4], and compiler optimization [6].

Dynamic page migration has been used in ccNUMA machines to improve data locality. In this scheme, the operating system keeps track of memory page counters to check whether to move a page to a remote node that frequently accesses it. However, this information cannot be associated with the semantics of the program, and if compile-time information is not passed to the operating

Figure 18 ADI kernel code

system, the latter may move a page when it is not helpful to do so, or it may not move a page that should be migrated. User-level page migration is then required to improve this approach. The compiler or performance tools are needed to help identify the regions of the program that are the most computation-intensive so that calls to page migration routines may be inserted accordingly. Although this is likely to lead to an improvement, it still cannot precisely represent the semantics of the program without compile-time analysis of data access patterns. Programming using data distribution directives has the potential to help the compiler ensure data locality; however, it means that programmers should be aware of the pattern in which threads will access data across large program regions. Hence the programming task remains more complex than with plain OpenMP. Moreover, there is no agreed standard for specifying data distributions in OpenMP; existing ccNUMA compilers have implemented their own data distribution directives, sacrificing portability. The most successful existing strategy is also the simplest: first-touch data allocation places data so that it is local to the thread that first allocates it. Our algorithm also uses this technique, provided by SGI, HP and Sun among others, to ensure appropriate initial storage for data. Thus our work can be viewed as attempting to maintain this locality.

We translate OpenMP to tasks that may share data and map these tasks to a cc-NUMA platform. Most researchers who focus on task mapping problems have performed work to handle the mapping of processes on distributed systems [32][30]. Our tasks may be finer grained than these. Moreover, most distributed memory tasks already have a clear communication structure, which helps in the evaluation of alternatives and generally forms the basis for the solution proposed. Very little work has considered the problem of mapping shared memory tasks on distributed shared memory systems, such as cc-NUMA, where there is a lower penalty for making non-local accesses and the code does not make non-local data references explicit. Liang et al. [31] proposed a static task mapping method for handling shared memory programs on a DSM system based on a two-dimensional Hopfield neural network to map a group of related tasks to a group of workstations that provide DSM. However, the neural network method is too costly to apply to dynamic cases. They also have difficulty estimating the cost that is incurred by remote memory accesses and cache misses, since the actual location of shared data depends on the operating system. We use the first touch approach to minimize the last of these problems.

6 CONCLUSION AND FUTURE WORKS

Our goal is to search for compiler techniques that improve the performance of OpenMP codes without the necessity of manual program modification, especially on large ccNUMA systems. Difficulties include the cost of non-local data accesses, which remain significantly higher than local references, the high amount of global synchronization in most OpenMP codes, and false sharing of data in cache. To achieve our aims, we translate OpenMP codes to a form that may be executed in a dataflow fashion that enables large-scale out-of-order execution, and explore the ability of the compiler to reduce synchronization overheads, as well as to improve data reuse in the resulting ordered collection of tasks [16]. In this paper, we present a practical algorithm for the generation and improvement of a task graph that represents the necessary synchronization in an OpenMP code. We propose an algorithm for mapping the resulting tasks to processors for data reuse. From this, we can generate code scheduling tasks for reuse at runtime. We also demonstrate that this scheme can enable macro-wavefront pipelined computation. Our approach relies on compiler technology and a suitable runtime system; thus there is no need for user intervention.

Despite using a relatively inefficient runtime system in our experiments, the results with LU and ADI kernels show that our strategy has improved performance. However, there is more to be learned. Our mapping algorithm is sensitive to the method for weighting of nodes and edges, which must represent data transfer and cache effects; we intend to continue to experiment with these parameters. Our current effort does not take the problem of false sharing of cache data into account and this needs to be considered. Our future work also includes exploiting symbolic analysis to

improve our ability to handle more difficult cases and to lower the overheads of dynamic task scheduling.

Acknowledgement

This work has been supported by the Los Alamos Computer Science Institute under grant number LANL 03891-99-23 (DOE W-7405-ENG-36) and by the NSF via the computing resources provided at NCSA. We thank Suvas Vajracharya for introducing us to SMARTS, discussing its functionality with us and encouraging us to use the library. REFERENCES [1] OpenMP Architecture Review Board, Fortran 2.0 and C/C++ 1.0 Specifications. At www.openmp.org. [2] B.Chapman, P.Mehrotra, and H.Zima, Enhancing OpenMP with Features for Locality Control. Proc. ECMWF

Workshop “Towards Teracomputing - The Use of Parallel Processors in Meteorology”. Reading, England, November 1998.

[3] J. Bircsak, P. Craig, R. Crowell, J. Harris, C. A. Nelson, C. D. Offner, Extending OpenMP For NUMA Machines: The Language, WOMPAT 2000, Workshop on OpenMP Applications and Tools, San Diago, July 2000.

[4] D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta and E. Ayguade, Is Data Distribution Necessary in OpenMP? Proc. of the IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conference, Dallas, Texas, November 2000.

[5] L. A. Smith and P. Kent, Development and Performance of a Mixed OpenMP/MPI Quantum Monte Carlo Code,Pro- ceedings of the First European Workshop on OpenMP, Lund, Sweden, Sept. 1999, pages. 6-9.

[6] S. Satoh, K. Kusano, and M. Sato, Compiler Optimization Techniques for OpenMP Programs, Second European Workshop on OpenMP (EWOMP 2000), Edinburgh, September 14-15, 2000.

[7] S. Vajracharya, S. Karmesin, P. Beckman, J. Crotinger, A. Malony, S. Shende, R. Oldehoeft, and Stephen Smith, Exploting Temporal Locatlity and parallelism through Vertical Execution. ICS ’99.

[8] S. Vajracharya, P. Beckman, S. Karmesin, K. Keahey, R. Oldehoeft, and C. Rasmussen, Programming Model for Cluster of SMP. PDPTA ’99.

[9] P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3), pages 350-360, July 1991.

[10] V. Balasundaram, K. Kennedy, A Techniques for Summarizing Data Access and its Use in Parallelism Enhancing Transformations, Proceedings of ACM SIGPLAN'89 Conference on Programming Language Design and Implementation, Jun. 1989.

[11] R. Triolet, F. Irigoin, P. Feautrier, Direct Parallelization of CALL Statements, Proceedings of ACM SIGPLAN '86 Symposium on Compiler Construction, Palo Alto, CA, July 1986, pages 176-185.

[12] B'eatrice Creusillet. IN and OUT array region analyses. In Workshop on Compilers for Parallel Computers, June 1995.

[13] B. Creusillet and F. Irigoin. Interprocedural array region analyses. In Lecture Notes in Computer Science - Languages and Compilers for Parallel Computing, pages 46-60. Springer-Verlag, August 1995.

[14] Michael Hind, Michael Burke, Paul Carini, and Sam Midkiff. An emperical study of precise interprocedural array analysis. Scientific Programming, 3(3), pages 255-271, 1994.

[15] Zhiyuan Li and Junjie Gu and Gyungho Lee. Symbolic array dataflow analysis for array privatization and program parallelization. In Supercomputing 95, 1995.

[16] Tien-Hsiung Weng, Barbara M. Chapman. Implementing OpenMP using Dataflow execution Model for Data Locality and Efficient Parallel Execution. In Proceedings of the 7th workshop on High-Level Parallel Programming Models and Supportive Environments, (HIPS-7), IPDPS, April 2002, Ft. Lauderdale.

[17] P. Tang. Exact Side Effects for Interprocedural Dependence Analysis. Communications of the ACM, 35(8), pages102- 114, August 1992.

[18] William Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis; Communications of the ACM, 35, pages 102-114, August 1992.

[19] Zhenying Liu, Barbara Chapman, Tien-hsiung Weng, Oscar Hernandez. Improving the Performance of OpenMP by Array Privatization. WOMPAT 2002, LNCS 2716, pages 244-256.

[20] Zhenying Liu, Barbara Chapman, Yi Wen, Lei Huang, Tien-hsiung Weng, Oscar Hernandez. Analyses for the Translation of OpenMP Codes into SPMD Style with Array Privatization. WOMPAT 2003, LNCS 2716, pages 26-41.

[21] T. Brandes and F. Desprez. Implementing pipelined computation and communication in an HPF compiler. In L. Boug'e, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Euro-Par'96 Parallel Processing, number 1123 in Lecture Notes In Computer Science, pages 459-462, Lyon, France, August 1996. Springer.

[22] F. Desprez. A Library for Coarse Grain Macro-Pipelining in Distributed Memory Architectures. In Proceedings of the IFIP 10.3 Conference on Programing Environments for Massively Parallel Distributed Systems, pages 365 371,

Monte Verita, Ascona, CH, April 1994. [23] S. Hiranandani, K. Kennedy, and C-W. Tseng. Evaluating compiler optimizations for Fortran D. Journal Of Parallel

and Distributed Computing, 21, pages 27 45, 1994. [24] Gonzalez, M., Ayguade, E., Martorell, X. And Labarta, J.: Complex Pipelined Executions in OpenMP Parallel

Appliations. International Conferences on Parallel Processing(ICPP 2001), September (2001) [25] Mohammad R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers.

ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 18, number 4, pages 477-518, 1996. [26] Thomas Fahringer and Bernhard Scholz. Symbolic evaluation for parallelizing compilers. In Proceedings of the

11th international conference on Supercomputing, pages 261-268, Vienna, Austria,1997. [27] W. Blume and R. Eigenmann. The range test: A dependence test for symbolic, non-linear expressions. In

Proceedings of Supercomputing '94. IEEE Press, November 1994. [28] D. S. Nikolopoulos and E. Artiaga and E. Ayguadé and J. Labarta. Exploiting memory affinity in OpenMP through

schedule reuse. ACM SIGARCH Computer Architecture News, vol.29 (5), pages 49-55, 2001. [29] B. Chamberlain, C. Lewis and L. Snyder. Array Language Support for Wavefront and Pipelined Computations In

Workshop on Languages and Compilers for Parallel Computing, August 1999. [30] T. Yang, A. Gcrasoulis, PYRROS: Static task scheduling and code generation for message passing multiprocessors,

in: International Conference on Supercomputing, 1992. [31] Liang, T. Y., Shieh, C. K., and Zhu, W. P., Task Mapping on Distributed Shared Memory 25 Systems Using

Hopfield Neural Network, In Communication Networks and Distributed Systems Modeling and Simulation Conference, Western Multi-Conference '97 (January 1997), pages 37-43.

[32] Marco Vanneschi. Hetergeneous HPC Environments. Euro-Par’98, pages 21-34. [33] Yang, T., Gerasoulis, A. List Scheduling with and without Communication. Parallel Computing Journal (19), pages

1321-1344, 1993. [34] Weng, T.-H., Chapman, B., Wen, Y.: Practical Call Graph and Side Effect Analysis in One Pass. Technical Report,

University of Houston, Under preparation. [35] The Open64 compiler. http://open64.sourceforge.net/ [36] Barbara Chapman, Oscar Hernandez, Lei Huang, Tien-hsiung Weng, Zhenying Liu, Laksono Adhianto, Yi Wen.

Dragon: An Open64-Based Interactive Program Analysis Tool for Large Applications. To appear in the Proceedings of the 4th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT ’03).

Hierarchical Interconnection Networks Based on (3, 3)-Graphsfor Massively Parallel Processors

Gene Eu Jan Yuan-Shin Hwang Chung-Chih Luo Deron LiangDepartment of Computer ScienceNational Taiwan Ocean University

Keelung 20224Taiwan

Abstract

This paper proposes several novel hierarchical interconnection networks based on the (3, 3)-graphs, namely folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks, and flooded (3, 3)-networks. Just as the hypercubes, CCC, Peterson-networks, andHeawood networks, these hierarchical networks have the following nice properties: regulartopology, high scalability, and small diameters. Due to these important properties, these hi-erarchical networks seem to have the potential as alternatives for the future interconnectionstructures of multicomputer systems, especially massively parallel processors (MPPs). Fur-thermore, this paper will present the routing and broadcasting algorithms for these proposednetworks to demonstrate that these algorithms are as elegant as the algorithms for hypercubes,CCC, and Petersen- or Heawood-based networks.

Index Terms: Broadcasting Algorithm, Interconnection Networks, Routing Algorithm, (3, 3)-Graph, Folded (3, 3)-Networks, Root-Folded (3, 3)-Networks, Recursively Expanded (3, 3)-Networks, Flooded (3, 3)-Networks

Corresponding Author: Gene Eu Jan ([email protected])

1 Introduction

High-performance computing is centered around the concept of parallel processing. One majorway to achieve this goal is by integrating multiple processors through an interconnection network,especially massively parallel processors (MPPs) which can host hundreds or even thousands ofprocessors. The performance of the entire system is then determined not only by the processorsbut also by the underlying interconnection network. Hence, it is essential to have high-performanceinterconnection networks for MPPs to achieve high performance.

Various high-performance interconnection networks have been extensively studied in the litera-ture, such as meshes, hypercubes, twisted hypercubes [1, 6, 7], cube-connected cycles (CCC) [16],recursive networks [3, 4], and pyramids [18], etc. Among these networks, the hypercube familyhas been popular due to that it has several elegant properties: symmetry, regularity, high fault-tolerance, logarithmetic degree and diameter, self-routing, and simple broadcasting schemes [8,10]. Nevertheless, new networks are being proposed and analyzed with regards to their applica-bility and enhanced topological or performance properties. The most popular ones are those thatbased on the Petersen or Heawood graphs and their derivatives [12]. These include the foldedPetersen cube networks [14], root-folded Petersen networks, recursively expanded Petersen net-works [11], hierarchical Heawood networks [9], and the hyper Petersen networks [5]. The majorfeatures of Petersen networks are: regular topology, high scalability, small diameter, and less net-work cost than hypercubes.

This paper presents several new hierarchical interconnection networks based on the (3, 3)-graphs [15]: folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks,and flooded (3, 3)-networks. They all have basically the same features as the Petersen or Heawoodnetworks, and hence are perfect candidates to serve as the interconnection structures of massivelyparallel processors.

In general, the following major criteria are commonly used to evaluate an interconnection net-work: diameters, degrees, connectivity, and cost [14]. The diameter of a network is the maximumdistance among all node-pairs. The degree of a node is the maximum number of links connected toit. The node connectivity (edge connectivity) of a graph is the minimum number of nodes (edges)whose removal results in a disconnected network. The product of degree and diameter is usuallycalled the cost of the network. Consequently, any underlying interconnection networks of MPPsmust have the properties of small diameters, reasonable degrees, and low cost. This paper willevaluate these hierarchical networks based on these properties.

The rest of the paper is organized as follows. Section 2 describes the (3, 3)-network and itsaddressing scheme. Section 3 extends the (3, 3)-networks to several -dimensional hierarchicalnetworks and presents the routing and broadcasting algorithms for these hierarchical networks.Section 4 evaluates these hierarchical networks by comparing their topological properties. Sec-tion 5 concludes this paper.

1

(1,1) (1,2)

(1,3)

(2,1)

(2,2)

(2,3)

(3,3)

(3,1)

(3,2)

(1,0)

(2,0)

(3,0)

(4,3)

(4,1)

(4,0)

(4,2)

(0,1)

(0,2)

(0,0) (0,3)

Figure 1: The (3, 3)-Graph on 20 Vertices

2 The (3, 3)-Graphs

This section recapitulates the definition and important properties of the (3, 3) graphs [15]. Sincethe interconnecting network of a parallel computer system based on the (3, 3)-graph is called a(3, 3)-network, it will share the same properties of the (3, 3)-graph. In addition, this section willpresents the routing and broadcasting algorithms for the (3, 3)-networks. Please note that the (3, 3)-networks defined in this section will be called as basic (3, 3)-networks in order to differentiate themfrom the hierarchical networks proposed in Section 3.

2.1 Definition and properties

A graph with maximum degree

and diameter is called a (

, D)-graph. The maximum numberof nodes on a (

, D)-graph is bounded by the Moore bound [2]

and the fact that a

-regular graph with

odd must have an even number of vertices. Therefore, a(3, 3)-graph can have at most 20 vertices (since the Moore bound is

! " ).

An example of a (3, 3)-graph on 20 vertices is depicted in Figure 1.The hierarchical networks proposed in this paper are based on this form of 20-node (3, 3)-graph.

The node addressing scheme the 20-node (3, 3)-graph is defined as follows.

Definition 1 (20-node (3, 3)-Graphs)A 20-node (3, 3)-graph is a maximum (3, 3)-graph with a degree of 3 and a diameter of 3. It has a

2

maximum number of vertices (i.e. 20) and can be divided into 5 subgraphs, each of which consistsof 4 vertices. Consequently, every vertice can be denoted as a tuple

, where

represents

the subgraph number and

indicates the node number within subgraph. Formally, a 20-node

(3, 3)-graph is a graph , where

is the set of

vertices and "! "!#$

is the set of edges.

Following are the properties of the 20-node (3, 3)-graphs:

1. Node 0 of every subgraph

connects to each node within. Specifically,

% " & '( ) +* ,

.

2. In addition to the link to node 0 the other two links of node 1 of every subgraph

connectsto node 2 of subgraphs

%.- 0/1324. Specifically,

)- 5/1324 " 6 7)* ,

.

3. Simlilarly, node 2 of every subgraph

has links to node 0 of

and node 1 of the neighboringsubgraphs of

. That is,

" 8- ,/91324 : '7+* (.

4. Node 3 of every subgraph

is connected to node 3 of subgraphs%"- "0/91294

, i.e. % %"-

8/9124 : '7+* .

2.2 Basic routing and broadcasting algorithms

Due to the symmetric topology of the (3, 3)-graphs, the routing and broadcasting algorithms forthe basic (3, 3)-network can be easily developed. In order to give a more concise presentation ofthe routing and broadcasting algorithms, the following neighboring function is defined:Algorithm Neighbors(

)

Find the neighbors of the node

within the graph begin

Find the local neighbors within the same subgraph

;)<>=@?AB< % CDE DF

%G H G 3 I J LK % M I JN

Find the neighbors outside the subgraph

;.O<>=@P%AB<

CDE DF %H- ,/13294 Q M I J N %H- H/13294 R I J SK

Return all the neighbors of node

%" Rreturn

;.O<>=@P%A< % LT ;)<>=@?AB< end

Algorithm Neighbors

3

Based on this function, the routing algorithm for the basic (3, 3)-network can be presented asfollows.Algorithm Basic-Routing( )

To route the message

from node to node begin

while doset

if #

Neighbors( ) then forward

to ; set else if ( Neighbors( )

TNeighbors( ))

thenforward

to ; set

else beginfor each element # Neighbors( )

if Neighbors( )

thenforward

to ; set

end forend if

end doend

Algorithm Basic-Routing

Since the longest distance between any two nodes on the basic (3, 3)-network is 3, the algorithmBasic-Routing takes at most 3 steps to route a message

from the source node to the destination

node .Similarly the broadcasting algorithm on the basic (3, 3)-network can be easily designed based

on the topological properties the (3, 3)-graph. To broadcast a message on a basic (3, 3)-network,it is basically equivalent to sending a message from the root node (i.e. the source node of themessage) of a minimum spanning tree of the basic (3, 3)-network to all other nodes. Following isthe broadcasting algorithm for the basic (3, 3)-network.

Algorithm Basic-Broadcasting( )To broadcast the message

from the source node

begin1: copy

to all of its neighboring nodes

; #Neighbors( )

2: All the nodes; #

Neighbors( ) copy

to the neighboring nodes #(Neighbors(

;)

Neighbors( ))3: All the nodes #

(Neighbors(;

)

Neighbors( )) copy

to the neighboring nodes inNeighbors( )

Neighbors(

;)

endAlgorithm Basic-Broadcasting

4

(0,0)

(0,1) (0,2) (0,3)

(1,2) (4,2) (1,1) (4,1) (2,3) (3,3)

(1,0) (2,1) (4,0) (3,1) (1,0) (2,2) (4,0) (3,2) (2,0) (4,3) (3,0) (1,3)

Figure 2: Broadcasting from Node (0, 0)

Figure 2 shows an example of the process of broadcasting a message from node (0, 0). It iseasy to show that both Basic-Routing and Basic-Broadcasting algorithms have a constant timecomplexity .

3 Hierarchical Networks Based on (3, 3)-Graphs

This section presents several hierarchical extensions of the basic (3, 3)-networks: folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks, and flooded (3, 3)-networks.

3.1 Folded (3, 3)-Networks

To generalize the (3, 3)-network into dimensions, several schemes are possible. Among these thefollowing one called the folded (3, 3)-network is the most pronounced due to that it has the node-symmetric and edge-symmetric properties. It is defined based on the same concept used in thedefinition of the folded Petersen network [13, 14]. This section defines the folded (3, 3)-networksand present the routing and broadcasting algorithms.

3.1.1 Definition and Properties

The formal definition of the folded (3, 3)-network is given as follows.

Definition 2 (Folded (3, 3)-Network)An -dimensional folded (3, 3)-graph is defined as ) R

, where M R #$ and M M@ R R @R # 0! "

5

Figure 3: An Example of a Two-Dimensional Folded (3, 3)-Network.

A two-dimensional folded (3, 3)-network ) is shown in Figure 3. There are 20 edges be-tween any neighboring subnetworks. However, the figure only shows the detailed connectionsbetween subnetworks )

and ) , where .

O < deontes the

th subnetwork of an

-dimensional folded (3, 3)-network . . The detailed connections between other subnetworksare left out for brevity.

3.1.2 Routing and Broadcasting Algorithms

The routing and broadcasting algorithms for the basic (3, 3)-networks shown in the previous sectioncan be extend to route and broadcast messages on the folded (3, 3)-networks.

Algorithm ) -Routing

The addresses of source and destination nodes and are represented as K

and K , respectively.

beginfor

to

step

do. -Basic-Routing( @ )

endfor

end


The . -Basic-Routing( @ ) algorithm is modified from the Basic-Routing( )

algorithm of the previous section and is shown as follows.

6

Algorithm ) -Basic-Routing( @ )begin

while doset

if #

Neighbors( ) then forward

to ; set else if ( Neighbors( ) T Neighbors( ))

thenforward

to ; set

else beginfor each element # Neighbors( )

if Neighbors( )

thenforward

to ; set

end forend if

end doend

Algorithm ) -Basic-Routing

It is easy to see that the above algorithm has time complexity of , where is the dimensionof the network.

Similarly, the broadcasting algorithm for the folded (3, 3)-networks can be extended from theBasic-Broadcasting algorithm for the (3, 3)- networks and is shown as follows.

Algorithm ) -Broadcasting( )begin

for

to

step

doBasic-Broadcasting( @ )

endfor

end

Algorithm ) -Broadcasting

Since the Basic-Broadcasting( ) algorithm requires constant time to execute, the time com-plexity of the above algorithm is again .

3.2 Root-Folded (3, 3)-Networks

Since every node, say . , of an -dimensional folded (3, 3)-network ) , is connected to

any of its neighboring nodes, say .

, with 20 edges, the cost of connections will be high as

grow larger. This section presents a new class of hierarchical (3, 3)-networks, called Root-Folded(3, 3)-Networks, as a remedy. The main difference is that any two neighboring nodes are nowconnected by a single edge.

7

Figure 4: A Two-Dimensional Root-Folded (3, 3)-Network


Definition 3 (Root-Folded (3, 3)-Networks)An -dimensional Root-Folded (3, 3)-network is defined as

. R where #$ and

@ @ M " S # ,0 &


The routing and broadcasting algorithms are listed as follows:Algorithm

) -Routing( )begin

while do"

if

then break and are the same node

else

beginfor

to

do Basic-Routing( @ )for

"to

do Basic-Routing( )

endelse

8

endwhile

end

Algorithm

) -Routing

where

is defined as

. Thetime complexity of the above routing algorithm for -dimentional root-folded (3, 3)-networks is .Algorithm

) -Broadcasting( )begin

for

to

do Basic-Routing( )for

to

do Basic-Broadcasting( )end

Algorithm

) -Broadcasting

Since the Basic-Broadcasting algorithm takes a constant time to execute, the time complexity ofthe above algorithm is again .

3.3 Recursively Expanded (3, 3)-Networks

The recursive expansion (RE) method has been applied on the Petersen networks [17]. The samemethod can be applied on the (3, 3)-networks as well, and the new hierarchical networks can becalled as recursively expanded (3, 3)-networks (RE (3, 3)-networks). Each -dimensional RE (3, 3)-network can have up to

Q nodes. Furthermore, the degree of each node is 6.

Consequently, RE (3, 3)-networks can aviod the bottlenecks caused by the root nodes of root-folded (3, 3)-networks.


Let the basic (3, 3)-network be G. A -dimensional RE (3, 3)-networks L can be defined by thefollowing recursive expansion method:

Definition 4 (Recursively Expanded (3, 3)-Networks)

let An -dimensional network is formed by connecting the nodes with the address

of the 20 subnetworks L

into a (3, 3)-network, where 4

.

Consequently, every node with the address 8 of an -dimensional RE (3, 3)-network

has 6 edges, 3 of which are edges of level 1 (i.e. within the same ) and the rest 3 of which

are connections within . On the other hand, all the nodes with the address greater than only

have 3 edges. Figure 5 shows a 2-dimensional RE (3, 3)-network.

9

Figure 5: A Two-Dimensional RE (3, 3)-Network 3.3.2 Routing and Broadcasting Algorithms

The routing algorithm for RE (3, 3)-networks can be adapted from the routing algorithm of basic(3, 3)-networks.Algorithm L -Routing

The addresses of source and destination node and are represented as K

and K , respectively R

beginfor

to

do L -Basic-Routing( @ )end

Algorithm L -Routing

Algorithm L -Basic-Routing( )begin

while doBasic-Routing( ; ) [

; : the node at level 0 with links to nodes of L ]

set

if #Neighbors( ) then forward

to ; set

else if ( Neighbors( ) T Neighbors( )) then

forward

to ; set else begin

for each element # Neighbors( )

10

if Neighbors( ) then

forward

to ; set end for

end ifend do

endAlgorithm L -Basic-Routing

The time complexity of the above routing algorithm for -dimentional root-folded (3, 3)-networksis .

Similarly, the broadcasting algorithm for an -dimentional RE (3, 3)-network can be obtainedby extanding the broadcasting algorithm for the basic (3, 3)-networks.Algorithm L -Broadcasting( )begin

L -Basic-Broadcasting-All( G ) 8G

parallel dobegin

L -Broadcasting(the first neighbor of @ )L -Broadcasting(the second neighbor of @ )

...L -Broadcasting(the last neighbor of )

endparallel do

end

Algorithm L -Broadcasting

Algorithm L -Basic-Broadcasting-All( @G )begin

ifG

then stopelse

beginfor each neighboring node of whose flag is not equal to the address of do

beginis forwarded from to

The flag of is set to the address of end

for

Return (

G G )

endelse

end

Algorithm L -Broadcasting-All

The time complexity of the above algorithm is .

11

Figure 6: A Three-Dimensional Flooded (3, 3)-Network .3.4 Flooded (3, 3)-Networks

A new form of hierarchical (3, 3)-networks, called flooded (3, 3)-networks, will be presented. Itis recursively expanded similarly to the way RE (3, 3)-networks are expanded from basic (3, 3)-networks.


Definition 5 Flooded (3, 3)-Networks

let ) Each node of ) is connected with 19 nodes to form a basic (3, 3)-network, and the

resulted network will be a ) network.

Each of nodes of ) will be connected to 19 nodes to form a basic (3, 3)-

network, and the result of whole network will be a . network.

Figure 6 depicts a 3-dimensional flooded (3, 3)-network . . When , ) is exactly

a 1-dimensional RE (3, 3)-network L . However, flooded (3, 3)-networks can be expanded toinfinite dimensions.

12

An -dimensional flooded (3, 3)-network . can accommodate totally nodes. Every

leave node of ) (i.e. nodes at dimension ) is connected to 3 neighboring nodes, while theinternal nodes have 6 neighbors each. Therefore, the average degree of an -dimensional flooded(3, 3)-network will be ! 4


Algorithm . -Routing( )The addresses of source and destination nodes and are represented as K

and K , respectively.

beginfor

to

do . -Basic-Routing( )where

for

to do . -Basic-Routing( )end


Algorithm . -Basic-Routing( @ )begin

while doBasic-Routing( ; ) [

; : the port node from level

to level

]

set

if #Neighbors( ) then forward

to ; set

else if ( Neighbors( ) T Neighbors( )) then

forward

to ; set else begin

for each element # Neighbors( ) if Neighbors( )

thenforward

to ; set

end forend if

end doend

Algorithm ) -Basic-Routing

Algorithm . -Broadcasting( )begin

for G

to

do Basic-Routing( " @ )for

to

do Basic-Broadcasting( )end

Algorithm ) -Broadcasting

13

Both ) -Routing and . -Broadcasting have the time complexity of .

4 Evaluation

This paper uses the following topological properties to compare the hierarchical interconnectionnetworks presented in previous section with some popular hierarchical networks: diameters, de-grees, connectivity, and cost. Figure 7 lists the topological properties of various hierarchical net-works. The maximum number of links on nodes of the Petersen-based, Heawood-based and (3, 3)-based hierarchical networks that are expanded based on the same method are the same, but the(3, 3)-based hierarchical networks can have more nodes to accommodate processors for multipro-cessor computers. The costs of hypercube, folded Petersen, and folded (3, 3)-are , whichmakes them unsuitable to serve as the underlying interconnection network of large multiprocessorsystems, while the rest hierarchical networks in Figure 7 seem to be good candidates. In addition,the flooded (3, 3)-network is developed to be suitable for integrated-circuit implementation for itssuperior topological properties over the other networks: low cost, short diameter, regular topology,high scalability, and a small number of crossing edges. It is worthy noting that the topologicalstructure of flooded (3, 3)-networks can be applied to the hierarchical Petersen networks to obtainlower cost, shorter diameter, and a even smaller number of crossing edges. Without loss of gener-ality, it is assumed that the numbers of nodes in these hierarchical (3, 3)-networks are power of 20.The larger power is considered if the network size is in between two powers of 20 without loss ofthese properties.

5 Conclusion

This paper presented several hierarchical interconnection networks, derived from the (3, 3)-graph,for high-performance multicomputer systems. All the Heawood-based hierarchical networks in thepaper have the following nice properties: regular topology, high scalability, and small diameter.Furthermore, this paper also demonstrated that the routing and broadcasting algorithms for thesehierarchical networks were as elegant as those for the Petersen-based networks and hypercubes.

References

[1] S. Abraham an K. Padmanabhan, “An analysis of the twisted cube topology,” InternationalConference on Parallel Processing, Vol.1, pages 116-120, 1989.

[2] B. Bollobas, Extremal Graph Theory, Academic Press, London, 1978.

[3] S.K. Das, “Designing recursive networks combinatorially,” International Conference onGraph Theory, Combinatorics, and Computing, Baton Rouge, LA, Feb, 1991

14

# of Nodes Degree Diameter Cost

Hypercube

CCC 3

Folded Petersen4

Root-Folded Petersen4 3

Recursively Expanded Petersen4 6

Folded Heawood Root-Folded Heawood or 3

Recursively Expanded Heawood 6

Flooded Heawood

Folded Graph

Rooted-Folded Graph or 3

Recursively Expanded Graph 6

Flooded Graph 3.15

Figure 7: Topological Properties of Various Hierarchical Networks

[4] S.K. Das and A. Mao, “Embeddings in recursive combinatorial networks,” Workshop Graph-Theoretic Concepts in Computer Science, Wiesbadan, Germany, June 18-20, 1992

[5] S. K. Das, S. Ohring, and A. K. Banerjee, “Embeddings into hyper Petersen networks: Yetanother hypercube-like interconnection topology,” Journal of VLSI, special issue on intercon-nection networks, vol. 2, no. 4, pp. 335-351, 1995.

[6] A.H. Esfahanian, L.M. Ni, and B.E. Sagan, “On enhancing hypercube multiprocessors,”International Conference on Parallel Processing, pages 86-89, 1988

[7] P.A.J. Hilbers, M.R.J. Koopman, and J.L.A. van de Snepacheut, “The twisted cube,” Proc.Parallel Architectures and Algorithms Europe, pages 152-159, june 1987

[8] Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability,New York: McGraw-Hill, 1993.

[9] Gene Eu Jan, Yuan-Shin Hwang, Ming-Bo Lin, and Deron Liang, “Noval hierarchical inter-connection networks for high-performance multicomputer systems,” to appear in Journal ofInformation Science and Engineering (under minor revision).

15

[10] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis, “Introduction to Paral-lel Computing: Design and analysis of Algorithms,” Redwood City, California: The Ben-jamin/Cummings Publishing Company, Inc., 1994.

[11] Ming-Bo Lin and Gene Eu Jan, “Routing and broadcasting algorithms for the root-foldedpetersen networks,” Journal of Marine Science and Technology, Vol.6, No.1, pages 65-70,1998.

[12] James A. McHugh, Algorithmic Graph Theory, Englewood Cliffs, New Jersey: Prentice-Hall,1990.

[13] Sabine Ohring and Sajal K. Das, “The folded Petersen network: A new communication-efficient multiprocessor topology,” 1993 International Conference on Parallel Processing,Vol. I, pp. 311-314, Aug. 1993.

[14] Sabine Ohring and Sajal K. Das, “Folded Petersen cube networks: new competitors for thehypercubes,” IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 2, pp. 151-168,February 1996.

[15] Robert W. Pratt, “The Complete Catalog of 3-Regular Diameter-3 Planar Graphs,” Depart-ment of Mathematics, University of North Carolina at Chapel Hill, 1996.

[16] F. Preparata and J. Vuillemin, “The cube-connected cycles: a versatile network for parallelcomputation,” Communications of the ACM, Vol.24, No.5, pages 300-309, 1981.

[17] H. Shen, “A high performance interconnection network for multiprocessor,” Parallel Com-puting, pages 993-1001, Sep. 1992.

[18] L. Uhr, Multicomputer architecture for artificial intelligence, Wiley Interscience, New York,1987.

16

Session 6: Distributed Computing

1/18

HIGH SYSTEM AVAILABILITY USING NEIGHBOR REPLICATION ON GRID

M. MAT DERIS, A.NORAZIAH, M.Y. SAMAN, A.NORAIDA, * Y.W.YUAN,

University College Terengganu,Faculty of Science and Technology , 21030,MengabangTelipot,Kuala Terengganu,Malaysia.

E-mail: [email protected] [email protected] [email protected]

*Department of Computer Science and Technology,

Zhuzhou Institute of technology ,Hunan, China,412008 E-mail : [email protected]

Abstract

Data Replication can be used to improve the system availability of distributed systems. In such a system, a mechanism is required to maintain the consistency of the replicated data. The grid structure technique based on quorum is one of the solutions to perform this while providing a high availability of the system. It was shown in the study that, it still requires a bigger number of copies be made available to construct a quorum. So it is not suitable for large systems. In this paper, we propose a technique called the neighbor replication on grid (NRG) technique by considering only neighbors to have the replicated data. In comparison to the grid structure technique, NRG requires a lower communication cost for an operation, while providing a higher system availability, which is preferred for large systems.

Keywords : replicated data, binary grid assignment, storage capacity, communication cost, system availability.

1. INTRODUCTION

Replication is a useful technique for distributed systems, where it increases the data

availability and accessibility to users despite site and communication failures [1,2,3]. It is

an important mechanism because it enables organizations to provide users with access to

current data where and when they need it. Availability refers to the probability that the

2/18

system is operational according to its specification at a given point in time. System

availability refers to the availability of a read or/and write operations of the system.

To maintain the consistency and integrity of data in distributed database systems,

expensive synchronization mechanisms are needed. One of the simplest techniques for

managing replicated data is read-one write-all (ROWA) technique [3]. Read operations on

a data object are allowed to read any copy, and write operations are required to write all

copies of the data object. This technique has been proposed for managing data in mobile

and peer-to-peer (P2P) environment [2]. However, it provides read operations with high

degree of availability at low cost but severely restricts the availability of write operations

since they cannot be executed at the failure of any copy. This technique results in the

imbalance of availability an also the communication cost of read and write operations.

The read operations have a high availability and low communication cost whereas write

operations have a low availability with higher communication cost. Voting techniques [7]

became popular because they are flexible and are easily implemented. One weakness of

these techniques is that writing an object is fairly expensive: a write quorum of copies

must be larger than the majority of votes. To optimize the communication cost and the

availability of both read and write operations, the quorum technique generalizes the

ROWA technique by imposing the intersection requirement between read and write

operations [5]. Write operations can be made fault-tolerant since they do not need to

access all copies of data objects. Dynamic quorum techniques have also been proposed

to further increase availability in replicated databases [8,9]. However, these approaches

do not address the issue of low-cost read operations. Another technique, called Tree

quorum technique [10,11] uses quorums that are obtained from a logical tree structure

imposed on data copies. However, this technique has some drawbacks. If more than a

majority of the copies in any level of the tree become unavailable, write execution cannot

be executed.

Recently, several researchers have proposed logical structures on the set of copies in

the database, and using logical information to create intersecting quorums. The technique

that uses a logical structure such as grid structure technique, executes operation with low

communication costs while providing fault-tolerance for both read and write operations

[6]. However, this technique still requires that a bigger number of copies be made

3/18

available to construct a quorum.

In this paper, without loss of generality, the terms node and site will be used

interchangeably. If data is replicated to all sites, the storage capacity becomes an issue,

thus an optimum number of sites to replicate the data is required with non-tolerated

availability of the system. The focus of this paper is on modeling a technique to optimize

the communication costs and the system availability of the replicated data in the

distributed systems.

We describe the Neighbor Replication on Grid (NRG) technique by considering only

neighbors obtain a data copy. For simplicity, the neighbors are assigned with vote one and

zero otherwise. This assignment provides a minimum communication cost with higher

system availability due to the minimum number of quorum size required. Moreover, it

can be viewed as an allocation of replicated copies such that a copy is allocated to a site if

and only if the vote assigned to the site is one. This approach is particularly useful for

large systems and also for P2P under the fixed network environment.

This paper is organized as follows: In Section 2, we review the three quorum and the

grid structure techniques which are then compared to the proposed technique. In Section

3, the model and the technique of the NRG is presented. Section 4, analyzed the

performance of the proposed technique is in terms of communication cost and availability,

and a comparison with grid technique is given. The conclusion is presented in section 5.

2. REVIEW OF REPLICA CONTROL TECHNIQUE

In this section, we review some of the replica control techniques namely the Tree

Quorum (TQ) technique, and Grid Structure (GS) technique, which are then compared to

the proposed NRG technique.

2.1 Tree quorum technique

Agrawal and El-Abbadi [10] proposed a logical tree structure over a network of sites

to achieve fault-tolerant distributed mutual exclusion. One advantage of this protocol is

that a read operation may access only one copy and the number of copies to be accessed

by a write operation is always less than a majority of quorum, i.e., the size of quorum

4/18

may increase to maximum of n/2 +1 sites when failures occur. Write operations in the tree

protocol must access a write quorum which is formed from the root, a majority of its

children, and a majority of their children, and so forth until the leaves of the tree are

reached. In Figure 1, an example of a ternary tree with thirteen copies is illustrated. For

read operations, a read quorum can be formed by the root or the majority of the children

of the root. If any node in the majority of the children of the root fails, it can be replaced

by the majority of its children, and so on recursively can replace it. In the best case, a

read quorum consists of only root, 1. As the root fails, a quorum is formed by the

majority of the copies at level 1, e.g., 2,3,3,4, or 2,4. The majority of their children

replace nodes 2 and 3, respectively, if no majority at level 1 is accessible and only site 4

is accessible. Such quorums are 4,5,6 or 4,9,10. If copies at level 0 and level 1 fail, a

quorum is formed by the majorities of the children of the selected majority at level 1, e.g.,

5,7,8,10 or 9,10,11,12. For write operations, the size of a write quorum for a given

tree is fixed, but the members can be different. Such quorums are

1,2,3,5,6,8,10,1,3,4,9,10,11,13, etc.

The desirable aspects of this protocol are that the cost of executing read operations is

comparable to the ROWA protocol while the availability of write operations is

significantly better [1]. There are some drawbacks from this protocol. As an example, if

more than a majority of the copies at any level of the tree become unavailable, write

operations cannot be executed.

1 2 3 4

5 6 7 8 9 10 11 12 13

5/18

Figure 1. A tree organization of 13 copies of an object

2.2 Grid Structure technique

Maekawa [13] has proposed a technique for a grid structure by using the notion of

finite projective planes to obtain a distributed mutual exclusion algorithm where all

quorums are of equal size. Cheung et al. [14], extended Maekawa's grid technique for

replicated data that supports read and write operations. In this technique, if there are n

sites in the system, all sites are logically organized in the form of a two-dimensional

√n x √n grid structure as shown in Figure 2.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Figure 2. A grid organization with 25 copies of a data object

Read operations on the data item are executed by acquiring a read quorum that consists of

a copy from each column in the grid. Write operations, on the other hand, are executed by

acquiring a write quorum that consists of all copies in one column and a copy from each

of the remaining columns. The read and write operations of this technique is of the size

O(√n); this technique is normally referred to as the sqrt(R/W) technique. In Figure 2,

copies 1,2,3,4,5 are sufficient to execute a read operation whereas copies

1,6,11,16,21,7,13,19,25 will be required to execute a write operation. However, it still

has a bigger number of copies for read and write quorums, thereby degrading the

communication cost and data availability. It is also vulnerable to the failure of entire

column or row in the grid [6].

The notion of quorums has been generalized by Agrawal and Abbadi [6] in the

6/18

context of the grid structure. In general, a grid quorum has a length l and a width w, and

is represented as a pair <l,w>. A read operation is executed by accessing a read grid

quorum <l,w>, which is formed from l copies in each of w different columns. A write

operation is executed by writing a write grid quorum, <l,w>, which is formed from: a) l

copies in each of w columns as well as; b) any n - l +1 copies in each of n-w+1 columns.

The second component of the write quorum is to ensure the intersection between two

write quorums. In order to ensure the quorum intersection property between read and

write quorums is maintained, if the read grid quorum is of size <l,w> then the write grid

quorum must be <n-l+1,n-w+1>.

3. NEIGHBOR REPLICATION ON GRID (NRG) TECHNIQUE

3.1 Model A distributed database consists of a set of data objects stored at different sites in a

computer network. Users interact with the database by invoking transactions, which are

partially ordered sequences of atomic read and write operations. The execution of a

transaction must appear atomic: a transaction either commits or aborts [15,16].

In a replicated database, copies of a data object may be stored at several sites in the

network. Multiple copies of a data object must appear as a single logical data object to

the transactions. This is termed as one-copy equivalence and is enforced by the replica

control technique. The correctness criteria for replicated database is one-copy

serializability [17], which ensures both one-copy equivalence and the serializable

execution of transactions. In order to ensure the one-copy serializability, a replicated data

object may be read by reading a quorum of copies, and it may be written by writing a

quorum of copies. The selection of a quorum is restricted by the quorum intersection

property to ensure one-copy equivalence: For any two operations o[x] and o'[x] on a data

object x, where at least one of them is a write, the quorum must have a non-empty

intersection. The quorum for an operation is defined as a set of copies whose number is

sufficient to execute that operation.

Briefly, a site i initiates a NRG transaction to update its data object. For all accessible

data objects, a NRG transaction attempts to access a NRG quorum. If a NRG transaction

7/18

gets a write quorum with non-empty intersection, it is accepted for execution and

completion, otherwise it is rejected. We assume for the read quorum, if two transactions

attempt to read a common data object, read operations do not change the values of the

data object. Since read and write quorums must intersect and any two NRG quorums

must also intersect, then all transaction executions are one-copy serializable.

3.2 The NRG Technique

In NRG, all sites are logically organized in the form of a two-dimensional grid

structure. For example, if a NRG consists of twenty-five sites, it will be logically

organized in the form of 5 x 5 grid as shown in Figure 1. Each site has a master data file.

In the remainder of this paper, we assume that replica copies are data files. A site is

either operational or failed and the state (operational or failed) of each site is statistically

independent to the others. When a site is operational, the copy at the site is available;

otherwise it is unavailable.

Definition 3.2.1: A site X is a neighbor to site Y, if X is logically-located adjacent to Y.

A data will replicate to the neighboring sites from its primary site. The number of data

replication, d, can be calculated using Property 3.2, as described below.

Property 3.2 : The number of data replication from each site, d ≤ 5.

Proof: Let n be a set of all sites that are logically organized in a two-dimensional grid

structure form. Then n sites are labeled m(i,j), 1≤i<√n, 1≤j<√n. Two way links will

connect sites n(i,j) with its four neighbors, sites m(i±1,j) and m(i,j±1), as long as there are

sites in the grid. Note that, four sites on the corners of the grid, have only two adjacent

sites, and other sites on the boundaries have only three neighbors. Thus the number of

neighbors of each site is less than or equal to 4. Since the data will be replicated to

neighbors, then the number of data replication from each site, d, is:

d ≤ the number of neighbors + a data from site itself

= 4+1 =5.

8/18

For example, from Figure 1, data from site 1 will replicate to site 2 and site 6 which are

its neighbors. Site 7 has four neighbors, which are sites 2, 6, 8, and 12. As such, site 7 has

five replicas

For simplicity, the primary site of any data file and its neighbors are assigned with

vote one and vote zero otherwise. This vote assignment is called neighbor binary vote

assignment on grid. A neighbor binary vote assignment on grid, B, is a function such that

B(i) ∈ 0,1, 1≤ i ≤n

where B(i) is the vote assigned to site i. This assignment is treated as an allocation of

replicated copies and a vote assigned to the site results in a copy allocated at the neighbor.

That is,

1 vote ≡ 1 copy.

Let

∑=

=n

)i(B

1i

LB

where, LB is the total number of votes assigned to the primary site and its neighbors and it

also equals to the number of copies of a file allocated in the system. Thus, LB = d.

Let r and w denote the read quorum and write quorum, respectively. To ensure that

the read operation always gets up-to-date values, r + w must be greater than the total

number of copies (votes) assigned to all sites. The following conditions are used to

ensure consistency:

1) 1≤ r≤LB, 1≤ w≤LB,

2) r + w = LB +1.

Conditions (1) and (2) ensure that there is a nonempty intersection of copies between

every pair of read and write operations. Thus, the conditions ensure that a read operation

can access the most recently updated copy of the replicated data. Timestamps can be used

to determine which copies are most recently updated.

Let S(B) be the set of sites at which replicated copies are stored corresponding to the

assignment B. Then

S(B) = i| B(i) = 1, 1≤ i≤n.

9/18

Definition 3.2.2: For a quorum q, a quorum group is any subset of S(B) whose size is

greater than or equal to q. The collection of quorum group is defined as the quorum set.

Let Q(B,q) be the quorum set with respect to the assignment B and quorum q, then

Q(B,q) = G| G⊆S(B) and |G| ≥q

For example, from Figure 2, let site 8 be the primary site of the master data file x. Its

neighbors are sites 3, 7, 9 and 13. Consider an assignment B for the data file x, such that

Bx(8)=Bx(3)=Bx(7)=Bx(9)=Bx(13) = 1

and LBx= Bx(8)+ Bx(3)+ Bx(7)+ Bx(9)+ Bx(13) = 5.

Therefore, S(Bx) = 8,3,7,9,13.

If a read quorum for data file x, r =2 and a write quorum w = LBx-r+1 = 4, then the

quorum sets for read and write operations are Q(Bx,2) and Q(Bx,4), respectively, where

Q(Bx,2) = 8,3,8,7,8,9,8,13,8,3,7,8,3,9,8,3,13,8,7,9,8,7,13,

8,9,13,3,7,9,3,7,13,3,9,13,7,9,13,8,3,7,9,8,3,7,13,

8,3,9,13, 8,7,9,13, 3,7,9,13, 8,3,7,9,13

and

Q(Bx,4) = 8,3,7,9,8,3,7,13,8,3,9,13, 8,7,9,13, 3,7,9,13, 8,3,7,9,13

3.3 The Correctness of NRG

In this section, we will show that the NRG technique is one-copy serializable. We

start by defining sets of groups called coterie [13] and to avoid confusion we refer the

sets of copies as groups. Thus, sets of groups are sets of sets of copies.

Definition 3.3.1. Coterie. Let U be a set of groups that compose the system. A set of

groups T is a coterie under U iff

i) G ∈ T implies that G ≠ ∅ and G ⊆ U.

ii) If G, H ∈ T then G ∩ H ≠ ∅ ( intersection property)

10/18

iii) There are no G, H ∈ T such that G ⊂ H (minimality).

By definition of coterie and from Definition 3.2.2, then Q(B,w) is a coterie, because

it satisfies all coterie's properties.

Since read operations do not change the value of the accessed data object, a read

quorum does not need to satisfy the intersection property. To ensure that a read operation

can access the most recently updated copy of the replicated data, two conditions in sub-

section 3.2 must be conformed. While a write quorum needs to satisfy read-write and

write-write intersection properties. The correct criterion for replicated database is one-

copy serializable. The next theorem provides us with a mechanism to check whether

NRG is correct.

Theorem 3.3.1. The NRG technique is one-copy serializable.

Proof: The theorem holds on condition that the NRG technique satisfies the quorum

intersection properties, i.e., write-write and read-write intersections. Since Q(B,w) is a

coterie then it satisfies the write-write intersection. However, for the case of a read-write

intersection, it can be easily shown that ∀G∈Q(B,r) and ∀H∈Q(B,w), then G ∩ H ≠ ∅ .

4. PERFORMANCE ANALYSIS AND COMPARISON

In this section, we analyze and compare the performance of the NRG technique with

grid structure technique on the communication cost and the system availability.

4.1 Communication Costs and Availability Analysis

The communication cost of an operation is directly proportional to the size of the

quorum required to execute it. Therefore, we represent the communication cost in terms

of the quorum size. In estimating the availability, all copies are assumed to have the same

availability p. CX,Y denotes the communication cost with X technique for Y operation,

11/18

which is R(read) or W(write).

4.1.1 Grid Structure (GS) technique

Let n be the number of copies which are organized as a grid of dimension √n x √n.

Read operations on the replicated data are executed by acquiring a read quorum that

consists of a copy from each column in the grid. Write operations are executed by

acquiring a write quorum that consists of all copies in one column and a copy from each

of the remaining columns, as discussed in Section 2.Thus, the communication cost,

CGS,R ,can be represented as:

CGS,R = √n,

and the communication cost, CGR,W , can be represented as:

CGS,W = √n + (√n -1) = 2√n - 1.

Let AX,Y be the availability with X technique for Y operation. In the case of the quorum

technique, read quorums can be constructed as long as a copy from each column is

available. Then, the read availability, AGS,R, is the probability of the system able to do

read operation at a given point in time. For the GS technique, AGS,R, as given in [4] is:

= [1-(1-p)√n]√n (1)

On the other hand, write quorums can be constructed as all copies from a column and one

copy from each of the remaining columns are available. Then, the write availability is the

probability of the system able to do write operation at a given point in time. For the GS

technique, AGS,W, as given in [4] is:

= [1-(1-p)√n]√n - [(1-(1-p)√n-p√n]√n. (2)

⎟⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜⎜

⎝

⎛

⎟⎠⎞

⎜⎝⎛

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

∑=

− −=

n

ip inp i

in

n

11

( )⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

⎟⎟

⎠

⎞⎜⎜

⎝

⎛

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

±+−−−

−−−

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−= p n

np np n nn

p np nn

n ...2

112

211

1

12/18

4.1.2 Tree quorum (TQ) technique

Let h denotes the height of the tree, D is the degree of the copies in the tree, and M =

⎡(D+1)/2⎤ is the majority of the degree of copies. When the root is accessible, the read

quorum size is 1. As the root fails, the majority of its children replace it, thus the quorum

size increases to M. Therefore, for a tree of height h, the maximum quorum size is Mh.

Hence, the cost of read operation, CTQ,R, ranges from 1 to Mh [6,11] i.e., 1≤CTQR≤ Mh.

The cost of a write operation, CTQ,W , can be represented as: CTQW = ∑Mi, i = 1…h. (3)

4.1.3 The NRG technique

Let pi denote the availability of site i. Read operations on the replicated data are

executed by acquiring a read quorum and write operations are executed by acquiring a

write quorum. For simplicity, we choose the read quorum equals to the write quorum.

Thus, the communication cost for read and write operations equals to ⎣LBx/2⎦, that is,

CNRG,R = CNRG,W = ⎣LBx/2⎦. For example, if the primary site has four neighbors, each of

which has vote one, then CNRG,R = CNRG,W == ⎣5/2⎦ = 3.

For any assignment B and quorum q for the data file x, define ϕ(Bx,q) to be the

probability that at least q sites in S(Bx) are available, then

ϕ( Bx,q) = Prat least q sites in S(Bx) are available

= ∑ ∏ ∏∈ ∈ −∈

⎟⎟⎠

⎞⎜⎜⎝

⎛−

)q,B(QG Gj G)B(Sjj

x x

)1( Pp j (4)

Thus, the availability of read and write operations for the data file x, are ϕ(Bx,r) and

ϕ(Bx,w), respectively. Let Av(Bx,r,w) denote the system availability corresponding to the

assignment Bx, read quorum r and write quorum w. If the probability that an arriving

operation of read and write for data file x are f and (1-f), respectively, then

Av(Bx,r,w) = f ϕ(Bx,r) + (1-f) ϕ(Bx,w).

Since in this paper, we consider the read and write operations are equal, then the system

13/18

availability,

Av(Bx,r,w) = ϕ(Bx,r) = ϕ(Bx,w). … (5)

Definition 4.1.3. Let Av(x) be the availability function with respect to x. Av(x) is in the

closed form if 0≤x≤1 then 0≤Av(x)≤1.

Theorem 4.13. The system availability under NRG technique are in the closed form.

Proof: For the case of read availability, from equation (4) and by definition 4.1.3, as

0≤pi≤1, i=1,2,…,LBx then 0≤ ϕ(Bx,r)≤1. Similarly, for the case of write availability where

0≤ϕ(Bx,w)≤1 as 0≤ pi ≤1.

4.2 Performance Comparisons

4.2.1 Comparison of costs

The communication cost of an operation is directly proportional to the size of the

quorum required to execute the operation. Therefore, we represent the communication

cost in terms of the quorum size. Table 1 shows the read and write costs of the two

techniques between NRG and logical Grid Structure (GS) for different total number of

copies, n = 25,36,49,64, and 81. The compared data of GS is derived from Agrawal and

Abbadi [2].

Table 1. Comparison of the read and write cost between GS and NRG under the different set of copies

Number of copies in the system 25 36 49 64 81

GS (R) 5 6 7 8 9 GS(W) 9 11 13 15 17 TQ(R) 8 8 11 16 16 TQ(W) 13 15 19 27 31 NRG(R/W) 3 3 3 3 3

From Table 1, it is apparent that NRG has the lowest cost for both read and write

operations in spite of having a bigger number of copies when compared with TQ and GS

quorums. It can be seen that, NRG needs only 3 copies for the read and write quorums for

all instances. On the contrary, for TQ with a tree of high 3 on 36 copies, the read cost is 8,

14/18

while for GS the cost is 6 with 36 copies. The write cost is 19 and 13 for TQ and GS

respectively, with 49 copies. These increase gradually as a number of copies increases.

4.2.2 Comparisons of system availabilities

In this section, we will compare the performance on the system availability of the TQ

and GS techniques based on equations (1), (2) and (3), and our NRG technique based on

equations (4) and (5) for the case of n=36, 49, and 64. In estimating the availability of

operations, all copies are assumed to have the same availability.

Figure 3 and Table 2 show that the NRG technique outperforms the TQ and GS

techniques. When an individual copy has availability 70%, write availability in the NRG

is approximately 84% whereas write availability in the TQ and GS is approximately 25%

and 56%, respectively for n=36. Moreover, write availability in the TQ and GS decreases

as n increases.

Table 2. Comparison of the write availability between TQ, GS and NRG under the different set of copies p=0.1,…,0.9

Write Availability Techniques

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

NRG(R/W) .009 .058 .163 .317 .500 .683 .837 .942 .991

GS(W),n=36 .000 .000 .002 .019 .083 .244 .526 .838 .989

GS(W),n=64 .000 .000 .000 .005 .030 .126 .378 .770 .988

TQ(W),n=36 2E-12 4E-8 1E-5 0.001 0.009 0.063 0.248 0.568 0.853

TQ(W),n=64 8E-25 9E-16 1E-10 4E-7 0.0001 0.007 0.107 0.489 0.847

00.20.40.60.8

11.2

0.1 0.3 0.5 0.7 0.9

Availability of a data copy

ope

ratio

n av

aila

bilit

y

NRGGS,N=36GS,N=64TQ,N=36TQ,N=64

15/18

Figure 3. Comparison of the write availability between TQ, GS and NRG

Table 3. Comparison of the system availability between TQ, GS and NRG

under the different set of copies with p= 0.7 and f =0.1,…,0.9 System Availability

Techniques0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

NRG .837 .837 .837 .837 .837 .837 .837 .837 .837

GS,N=36 .573 .620 .667 .714 .761 .808 .855 .902 .948

GS,N=64 .440 .502 .564 .627 .689 .751 .813 .875 .937

TQ,N=36 .323 .399 .474 .549 .624 .699 .774 .850 .925

TQ,N=64 .100 .286 .375 .464 .554 .643 .732 .821 .911

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9the probability of read of an

operation

sys

tem

ava

ilabi

lity

NRGGS,N=36GS,N=64TQ,N=36TQ,N=64

16/18

Figure 4. Comparison of the system availability between TQ, GS and NRG for p = 0.7

The NRG technique also outperforms the TQ and GS techniques when the

probability of an arriving operation is read, f ≤ 0.6, as shown in Figure 4 and Table 3.

This shows that the system availability is very sensitive to the read/write probability. For

example, when f = 0.5, system availability in TQ and GS for N = 36 is approximately

62% and 76% respectively, whereas system availability in the NRG is approximately

84% for any instances. Moreover, system availability in the TQ and GS decrease as n

increases. This is due to the decrease of write availability as n increases.

5. CONCLUSIONS

In this paper, a new technique, called neighbor replication on grid (NRG) has been

proposed to manage the data replication in distributed database systems. The analysis of

the NRG technique was presented in terms of system availability and communication

costs. It showed that, the NRG technique provides a convenient approach to achieve a

high availability for update-frequent operations by imposing a neighbor binary vote

assignment to the logical grid structure on data copies. This is due to the minimum

number of quorum size required. This technique also provides an optimal neighbor binary

vote assignment which is less computational time required. In comparison to the Tree

Quorum and Grid structure techniques, NRG requires significantly lower communication

cost for an operation, while providing higher system availability, which is preferred for

large systems. Thus, this technique represents an alternative design philosophy to the

quorum approach.

17/18

References

1. O. Wolfson, S. Jajodia, and Y. Huang,"An Adaptive Data Replication Algorithm,"ACM Transactions on Database Systems, vol. 22, no 2 (1997), pp. 255-314.

2. Budiarto, S. Noshio, M. Tsukamoto,”Data Management Issues in Mobile and Peer-

to-Peer Environment”, Data and Knowledge Engineering, Elsevier, 41 (2002), pp. 183-204.

3. K. Ranganathan, A. Iamnitchi, I. Foster,” Improving Data Availability through

Dynamic Model-Driven Replication in Large Peer-to-Peer Communities”, Global and P2P Computing on Large Scale Distributed Systems Workshop, Berlin, pp. 1-6,May 2002.

4. R.H. Thomas,”A majority concensus approach to concurrency control for multiple

copy databases”, ACM Transactions on Database Systems, 4(2):180-209, June 1979

5. D.K. Gifford,”Weighted voting for replicated data”, Proc. 7th Symp. On Operating

System Principles, pg 150-162, Dec. 1979.

6. D Agrawal and A.El Abbadi, "Using Reconfiguration For Efficient Management of Replicated Data,"IEEE Trans. On Knowledge and Data Engineering,vol.8,no. 5 (1996), pp. 786-801.

7. S.M. Chung, "Enhanced Tree Quorum Algorithm For Replicated Distributed

Databases," Int'l Conference on Database , Tokyo, 1990, pp. 83-89.

8. M.T. Ozsu and P. Valduriez,Principles of Distributed Database Systems, 2nd Ed., Prentice Hall, 1999.

9. J.F. Paris and D.E. Long, "Efficient Dynamic Voting Algorithms," Proc. Fourth

IEEE Int'l Conf. Data Eng, Feb.1988,pp. 268-275.

10. D Agrawal and A.El Abbadi, "The Tree Quorum technique: An Efficient Approach for Managing Replicated Data,"Proc.16th Int'l Conf. On Very Large Data Bases ,Aug. 1990, pp. 243-254.

11. D Agrawal and A.El Abbadi, "The generalized Tree Quorum Technique: An

Efficient Approach for Managing Replicated Data,"ACM Trans. Database Systems,vol.17,no. 4(1992), pp. 689-717.

12. S.Jajodia and D. Mutchles,"Dynamic Voting Algorithms for Maintaining the

Consistency of a Replicated Database,"ACM Trans. Database Systems,vol 15,no. 2(1990), pp. 230-280.

18/18

13. M. Maekawa," A √n Algorithm for Mutual Exclusion in Decentralized

Systems,"ACM Trans. Computer Systems,vol. 3,no. 2(1992), pp. 145-159. 14. S.Y. Cheung, M.H. Ammar, and M. Ahmad,"The Grid Technique: A High

Performance Schema for Maintaining Replicated Data," IEEE Trans. Knowledge and Data Engineering,vol. 4, no. 6(1992), pp. 582-592.

15. B. Bhargava,"Concurrency Control in Database Systems," IEEE Trans. Knowledge

and Data Engineering,vol 11,no. 1(1999), pp.3-16.

16. P.A. Bernstein, V. Hadzilacos, and N.Goodman, Concurrency Control and Recovery in Database Syatems,Addison-Wesley,1987.

17. P.A. Bernstein and N.Goodman, "An Algorithm for Concurrency Control and

Recovery in Replicated Distributed Databases,"ACM Trans. Database Systems,vol 9,no. 4(1994),pp.596-615.

MPICH-GF: Transparent Checkpointing and Rollback-Recovery forGrid-enabled MPI Processes

Namyoon Woo and Heon Y. Yeom

School of Computer Science and EngineeringSeoul National UniversitySeoul, 151-742, KOREA

nywoo,[email protected]

Taesoon Park

Department of Computer EngineeringSejong University

Seoul, 143-747, [email protected]

Abstract

Fault-tolerance is an essential element to the distributed system which requires the reliable computation en-vironment. In spite of extensive researches over two decades, practical fault-tolerance systems have not beenprovided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper,we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives areto fill the gap between the theory and practice of fault-tolerance systems and to provide a checkpointing-recoverysystem for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process man-agement for MPI and message queue management. MPICH-GF requires no modification of application sourcecodes and affects the MPICH communication as less as possible. The features of MPICH-GF are that it supportsthe direct message transfer mode and that all of the implementation has been done at the lower layer, that is, thevirtual device level. We have evaluated MPICH-GF with NPB applications on Globus middleware.

1 Introduction

A ‘computational grid’ (shortly a grid) is a specialized instance of a distributed system, which includes aheterogeneous collection of computers in different domains connected by networks [17, 18]. The grid has attractedconsiderable attention for its ability to utilize ubiquitous computational resources with a view of a single systemimage and it has been believed to benefit many computation-intensive parallel applications. Recently ArgonneNational Laboratory has proposed the ‘Globus Toolkit’ for the framework of the grid and it has been taken asthe de-facto standard of grid services [16]. Main features of the Globus Toolkit are global resource allocationand management, directory service, remote task running, user authentication and so on. Although the GlobusToolkit can monitor and manage global resources, it lacks dynamic process management such as fault-toleranceor dynamic load balancing which is essential to the distributed system.

Distributed systems are not reliable enough to guarantee the completion of parallel processes in a determinatetime because of their inherent failure factors. The system consists of a number of nodes, disks and network lines,which are all exposed to failures. Even a single local failure can be fatal to the parallel processes since it nullifiesall of the computation results which have been executed in cooperation with one another. Assuming that thereis one percent chance a single machine might crash during the execution of a parallel application, there is onlya percent chance the whole system with one hundred machines would be alive throughout the the

1

application run. In order to increase the reliability of distributed systems, it is important to provide the system withfault-tolerance.

Checkpointing and rollback-recovery is a well-known technique for fault-tolerance. Checkpointing is an oper-ation to store the states of processes into stable storage for the purpose of recovery or migration [12]. A processcan resume its previous state at any time with the latest checkpoint file and periodic checkpointing can minimizethe computation loss incurred by a failure. Although several stand-alone checkpoint toolkits [23, 31, 38] havebeen implemented, they are not sufficient for parallel computing due to the following reasons: First, they cannotrestore communication states such as sockets or shared memory. Second, they do not consider causal relationshipamong the states of processes. Process states may be dependent on one another in the message-passing environ-ment [9, 27]. Hence, the independent process recovery without consideration on dependency may tangle the globalprocess states. The states of inter-process relations should be reconstructed for consistent recovery. Consistent re-covery for message-passing processes has been extensively studied for over two decades and several distributedalgorithms for consistent recovery have been proposed. However, implementation of the algorithms seems anotherissue because they assume the followings:

A process detects a failure and revives by itself.

Both revived and survived processes can communicate after recovery without any procedure of channelreconstruction.

Checkpointed binary files are always available.

Also, there have been many efforts to make the theory practical [19, 28, 10, 6, 13, 24, 34, 22, 36, 4, 33,30]. Their approaches take different strategies in the following context; system-level versus application-levelcheckpointing, kernel-level versus user-level checkpointing, user-transparency, support of non-blocking messagetransfer and direct versus indirect communication method. According to their strategies, implementation levelsdiffer. However, the frameworks remain to be proofs or evaluation tools of the recovery algorithms. To the best ofour knowledge there are only a few systems which are actually used in practice [19, 11], although they are onlyvalid for the specific parallel programming model or the specific machine.

Our goal in this paper is to construct the practical fault-tolerance system for message-passing applications ongrids. We have integrated rollback-recovery algorithm withMessage Passing Interface (MPI) [14], thede-factostandard of message-passing programming. As a research result, we present our MPICH-GF that is based onMPICH-G2 [15], the grid-enabled MPI implementation. MPICH-GF is completely transparent to the applicationdevelopers and it requires no modification of application source codes. Any in-transit message during checkpoint-ing is never lost whether it is for blocking operation or nonblocking operation. We have expanded the Globusjob manager module to support rollback-recovery and to control distributed processes over domains. Above all,our main implementation issue is to provide dynamic process management that is not defined in the original MPIstandard (version 1). Since the original MPI standard specifies only static process group management, a revivedprocess, regarded as a new instance, cannot rejoin the process group. In order to enable a new MPI instance tocommunicate with the running processes, we have implemented a ‘MPI Rejoin()’ function. Currently coor-dinated checkpointing protocol has been implemented and tested with MPICH-GF, and other consistent recoveryalgorithms (e.g. message logging) are under development. MPICH-GF operates on Linux kernel ver 2.4 withGlobus toolkit 2.2.

The rest of this paper is organized as follows: In Section 2, we present the concept of consistent recoveryand the related works of the fault-tolerance system. Section 3 describes the operation of the original MPICH-G2. We propose the MPICH-GF architecture in Section 4 and address implementation issues in Section 5. Theexperimental results of MPICH-GF with Nas Parallel Benchmarks is shown in Section 6 and finally the conclusionsare presented in Section 7.

2

2 Background

2.1 Consistent Recovery

In the message-passing environment, states of processes may have dependency relation with one another,through message-receipt events. A consistent system state is the one in which for every message-receipt eventreflected in the system state, the corresponding sending-event should be reflected [9]. If a process rolls back to apast state but any other process whose current state is dependent on the lost state does not roll back, inconsistencyoccurs.

1

1

2

2

P

: Checkpoint

P

mm

Figure 1. Inconsistent global checkpoint

Figure 1 shows a simple example of two processes whose local checkpoints do not form a consistent globalcheckpoint. Suppose that two processes should roll back to their latest local checkpoints. Then, process hasnot sent the message yet, but has marked as being received. In this case, becomes anorphanmessage and it causes inconsistency. Message is considered as alost message, in the sense that waits forthe arrival of which has marked as being already sent. If any in-transit message is not recorded duringcheckpointing, it becomes a lost message. Both of these message types cause the abnormal execution of processesduring recovery.

Extensive researches on consistent recovery have been conducted [12]. Approaches to the consistent recoverycan be categorized into the coordinated checkpointing, the communication induced checkpointing and the messagelogging. In coordinated checkpointing, processes synchronize their local checkpointing so that a set of consistentglobal checkpoints can be guaranteed [9, 21]. On failure all the processes roll back to the latest global checkpointsfor consistent recovery. However, it is a dominant idea that coordination would not scale up. Communicationinduced checkpointing (CIC) allows processes to checkpoint independently as well as prevents thedomino effectusing the information piggy-backed on the message [2, 12, 37]. In [2], Alvisiet al. disputed against the beliefthat CIC would be scalable. According to their experimental report, CIC generates enormous forced checkpointsand the process autonomy in placing local checkpoints does not seem to benefit in practice. Message loggingrecords messages with checkpoint files in order to replay them for the recovery. It can be further classified intopessimistic, optimistic and causal message logging according to their policy on how to store message logs [3].Log-based rollback recovery minimizes the amount of lost computation with high storage overhead.

2.2 Related Works

Fault-tolerant systems for message-passing processes can be categorized as whether they support direct orindirect message transfer. With direct message transfer, application processes communicate with other processesdirectly, that is, client-to-client communication is possible. MPICH implementation is the representative case ofdirect transfer mode. With indirect message transfer, messages are sent through the medium like daemon: forexample, PVM and LAM-MPI [7] are the cases. In these systems, processes do not have to know any physical

3

address of another process. Instead they maintain the connection with the medium and hence the recovery of thecommunication context is relatively easy. However, increase in message delay is inevitable.

CoCheck [36] is a coordinated checkpointing system for PVM and tuMPI. While CoCheck for tuMPI supportsdirect transfer mode, the PVM version exploits PVM daemon to transfer messages. CoCheck exists as a thinlibrary layered over PVM (or MPI) that wraps the original API. The process control messages are implementedat the same level as application messages. Unless an application process calls therecv() function explicitly, itcannot receive any control messages. MPICH-V [6] is a fault-tolerant MPICH version that supports pessimisticmessage logging. Every message is transferred to the remoteChannel Memory server (CM) that logs and replaysmessages. CMs are assumed to be stable so that the revived process can recover simply by reconnecting to theCMs. According to their literature, the main reasons of using CMs are to cope with volatile nodes and to keeplog data safely, even though the system should pay twice the cost for message delivery. FT-MPI [13] proposed byFagg and Dongarra supports MPI-2’s dynamic task management. FT-MPI has been built on PVM or HARNESScore library that exploits daemon processes. Li and Tsay also proposed their LAM-MPI based implementationwhere messages are transferred via a multicast server on each node [22]. MPI-FT proposed by Batchuet al. [4]adopts task redundancy to provide fault-tolerance. This system has a central coordinator that relays messages toall the redundant processes. MPI-FT from Cyprus University [24] adopts message logging. An observer processorcopies all the messages and reproduces them for the recovery. MPI-FT pre-spawns processes at the beginning sothat one of the spare processes can inherit a failed process. This system has a high overhead for the storage of allthe messages.

There are also some research results supporting direct message transfer mode. Starfish [1] is a heterogeneouscheckpointing toolkit based on Java virtual machine, which makes it possible for the processes to migrate amongheterogeneous platforms. The limits of this system are that they have to be written in OCaml and that byte codesrun more slowly than native codes. Egida [33] is an object-oriented toolkit that provides both communicationinduced checkpointing and message logging for MPICH withch p4 device. An event handler hijacks events ofMPI operations in order to do the corresponding pre-defined actions for the rollback-recovery. To support theatomicity of message transfer, it substitutes non-blocking operations with blocking ones. The master process withrank 0 is responsible for updating communication channel information on recovery. Therefore, the master processis not supposed to fail. The current version of Egida is able to detect only process failures and the failed process isrecovered at the same node as it previously run. As a result, Egida cannot manage hardware failures. Hector [34]is similar to our MPICH-GF. Hector exists as a movable MPI library, MPI-TM, and several executables. Thereare hierarchical process managers which create and migrate application processes. Coordinated checkpointing hasbeen implemented. Before checkpointing, every process closes its channel connection to ensure that there is noin-transit message left in the network. Processes have to reconnect the channel after checkpointing.

One technique to prevent in-transit messages from being lost is to put off checkpointing until all the in-transitmessages are delivered. In CoCheck, processes exchangeready-messages (RMs) with one another for the purposeof coordination and to guarantee the absence of in-transit messages [36]. RMs are sent through the same channelas the application messages. Since the channels are based on TCP sockets that satisfy FIFO property, RM’sarrival guarantees that all the messages earlier than RM have arrived at the receiver. In Legion MPI-FT [30], eachprocess reports the number of messages sent and received to the coordinator at the request of checkpointing. Thecoordinator calculates the number of in-transit messages and then waits for the processes informing that all thein-transit messages have arrived.

The other way to prevent the loss of in-transit messages is to build a user-level reliable communication protocolin which in-transit messages are recorded at the sender’s checkpoint file as being undelivered. Methet al havenamed this technique as ‘stop and discard’ in [25]. Messages received during checkpointing are discarded. Aftercheckpointing, discarded messages are re-sent by the user-level reliable communication protocol. This mechanismrequires one more memory copy at the sender side. RENEW [28] is an recoverable runtime system proposed byNeveset al. It also has the user-level reliable communication layers upon the UDP transport protocol in order to

4

log messages and to prevent the loss of in-transit messages.Some researches use application-level checkpointing for migration among heterogeneous nodes or for mini-

mization of checkpoint file size [5, 29, 30, 32, 35]. This type of checkpointing burdens application developerswith decision of when to checkpoint, what to store in checkpoint file and how to recover with the stored informa-tion. The CLIP toolkit [10] for the Intel Paragon providessemi-transparent checkpointing environment for users.The user must perform minor code modifications to define the checkpointing locations. Although the system-levelcheckpoint file is not heterogeneous itself, we believe that user transparency is an important virtue since it does notseem to happen that a process has to recover on a heterogeneous node even if there exists plenty of homogeneousnodes. In addition, application developers are not willing to accept such a programming effort.

To sum up, most of the systems have been implemented using indirect communication mode or they are validfor specialized platforms only.

3 MPICH-G2

In this section, we describe the execution of MPI processes on Globus middleware and the communicationmechanism of MPICH-G2.‘Message Passing Interface’ (MPI) is thede-facto standard specification for message-passing programming that abstracts low-level message-passing primitives away from the developer [14]. Amongseveral MPI implementations, MPICH [20] is the most popular for good performance and portability. Goodportability of MPICH can be attributed to the abstraction of low-level operations, the Abstract Device Interface(ADI), as shown in Figure 2. An ADI’s implementation is called a virtual device. Currently MPICH version 1.2.3includes about 15 virtual devices. Especially MPICH with a grid deviceglobus2 is called MPICH-G2 [15]. Justfor reference, our MPICH-GF has been implemented as the unmodified upper MPICH layer and our own virtualdevice ‘ft-globus’ originated fromglobus2.

Collective OperationBcast(), Barrier(), Allreduce() ...

MPIImplementation

Point−to−point operationSend()/Recv(), Isend()/Irecv(), Waitall()...

ADI . . .ch_shmem ch_p4 globus2 ft−globus

Figure 2. Layers of MPICH

3.1 Globus Run-time Module

Globus toolkit [16] proposed by ANL is thede-facto standard of the grid middleware. It contains directoryservice, resource monitor/allocation, data sharing, authentication and authorization. However, it lacks dynamicrun-time process control; for example, dynamic load balancing or fault-tolerance.

Figure 3 describes how MPI processes are launched on Grid middleware. There are three main Globus moduleswhich concern the process execution: DUROC (Dynamic Updated Request Online Co-allocator), GRAM (GlobusResource Allocation Management) job managers and a gatekeeper. DUROC distributes a user request to localGRAM modules and then a gatekeeper on each node checks whether the user is authenticated or not. If (s)heis an authenticated user, the gatekeeper launches a GRAM job manager. Finally the GRAM job manager forks

5

MPI App.

DUROC(Central Manager)

(Local Manager)

GRAMJob Manager

fork()

fork()

. . .Gatekeeper

Figure 3. The procedure of process launching in Globus

and executes requested processes. Neither DUROC nor GRAM job manager controls processes dynamically, butjust monitors them. Anyway, Globus has the fundamental framework of hierarchical process management. Fordynamic process management, we’ve expanded DUROC and GRAM job manager’s capability by modifying theirsource codes. We present their expanded abilities in Section 4 and from this context we use the terms, ‘central/localmanager’ instead of DUROC and GRAM manager respectively for the convenience.

3.2 globus2 Virtual Device

Collective communication in MPICH-G2 is implemented as a combination of point-to-point (shortly P2P) com-munications based on non-blocking TCP sockets and active polling. MPICH has P2P communication primitiveswith the following semantics:

Blocking operation :MPI Send(), MPI Recv()

Non-blocking operation :MPI Isend(), MPI Irecv()

Polling : MPI Wait(), MPI Waitall()

The blocking send-operation submits a send-request to the kernel and waits until the kernel copies send-messageto the kernel-level memory. The blocking receive-operation also waits until the kernel delivers the requestedmessage to the user-level memory, while the non-blocking operation only registers its request and exits. Actualmessage delivery for the non-blocking operation is not performed until a polling function is called. Indeed, theblocking operation is the combination of the non-blocking operation and the polling function in MPICH-GF.

Communication mechanism ofglobus2 device is similar to that ofch p4 device [8]. Each MPI process opensa ‘listener port’ in order to accept a request for channel opening. On the receipt of a request, the receiver opensanother socket and constructs a channel for two processes. All the listener information is transferred to everyprocess at MPI initialization. The master process with rank 0 collects the listener information of the others andbroadcasts them. However,globus2 does not fork another listener process asch p4 does. The channel openningrequest is not be accepted until the receiver explicitly receives the request by calling the polling function.

Figure 4 shows the structure of process group tablecommworldchannel of the ‘globus2’ device. The-thentry incommworldchannel contains the information of a channel to the process with rank. In Figure 4,

6

tcp_miproto_t

send_q_head

hostname

listner port

handlep

send_q_tail

. . .

channel 1

channel i

channel n

channel 0

commworldchannel send queue tcpsendreq

buffer

source rank

destination rank

tag

datatype

. . .. . .

. . .

Figure 4. commworldchannel structure

we abstract this structure to show only the values of our concern. The pair of and is the addressof a listener. contains the real channel information. If is null, the channel has not opened yet.Send-operation pushes a request into the send-queue and registers this request toglobus io module. Then, thepolling function is called to wait until the kernel handles the request.

There are two receive-queues in MPICH: the unexpected queue and the posted queue (Figure 5). The formercontains arrived messages whose receive-requests have not been issued yet. If a receive-operation is called, itexamines the unexpected queue first whether the message has already arrived. If the corresponding message existsin the unexpected queue, it is delivered. Otherwise the receive-request is enqueued into the posted queue. On themessage arrival, the handler checks whether there is a corresponding request in the posted queue. If it exists, themessage is copied into the requested buffer. Otherwise, the message is pushed into the unexpected queue.

rhandle

source rank

tag

context id

buffer

datatypesource rank

. . .

posted queue Rhandleunexpected queue

Figure 5. Receive-queues of MPICH

4 MPICH-GF

In the MPICH-GF system, MPI processes run under the control of hierarchical managers, a central manager andlocal managers. The central manager and the local managers are responsible for hardware or network failures andprocess failures, respectively. They are also responsible for checkpointing and automatic recovery. In this section,we present MPICH-GF’s structure and checkpointing/recovery protocol implementation.

7

DynamicProcessManagement

Checkpoint Toolkit

Message QueueManagement

ft−globus

Collective Communication

P2P Communication

Originalglobus2

ImplementationMPI

ADI

Figure 6. The structure of MPICH-GF

4.1 Structure

Figure 6 describes the MPICH-GF library structure. Fault-tolerance module has been implemented at the virtualdevice level, ‘ft-globus’ and our MPICH-GF requires no modification of upper MPI implementation layer. Theft-globus device contains dynamic process group management, checkpoint toolkit and message queue management.Most of previous works have implemented fault-tolerance module at the upper layer making abstract of com-munication primitives, while our implementations has been performed at the lower layer. By doing so, previousapproaches neglect to support the characteristic of some specific communication operations: for example, the non-blocking operation. In addition, the low level approach is inevitable to reconstruct the physical communicationchannel.

4.2 Coordinated Checkpointing

For consistent recovery, the coordinated checkpointing protocol is employed. The central manager initiatesglobal checkpointing periodically as shown in Figure 7. Then, local managers signal processes withSIGUSR1so that the processes can be ready for checkpointing. The signal handler for checkpointing has been registeredin the MPI process by the MPI initialization. On receipt ofSIGUSR1, the signal handler executes a barrier-likefunction before checkpointing. By performing the barrier, two things can be guaranteed. One is that there isno orphan message between any two processes. The other is that there is no in-transit message because barriermessages push any previously issued message into the receiver. Channel implementation ofglobus2 is based onTCP sockets and hence FIFO property is kept. Pushed messages are stored at the receiver’s queue in the user-levelmemory so that checkpoint file can include them. When processes are recovered with this global checkpoint,every in-transit messages are also restored. This technique is similar toReady Message of CoCheck. After quasi-barrier, each process generates the checkpoint file and informs the local manager that it completes checkpointingsuccessfully. The central manager checks if all the checkpoint files have been generated and confirms the collectionof checkpoint files as a new version of global checkpoint.

4.3 Consistent Recovery

In the MPICH-GF system, hierarchical managers are responsible for failure detection and automatic recovery.We have implemented hierarchical managers by modifying original globus run-time module as described in Sec-tion 3. Since application processes are forked from the local manager, the local manager receivesSIGCHLDwhen

8

Central Manager

Checkpointing

InitiationRequest for checkpointing

Signal : SIGUSR1

Confirm

Wait n replies

ProcessLocal Manager

Checkpoint completed

Checkpointing

Barrier

Figure 7. Coordinated checkpointing protocol

its forked application process terminates. Upon receiving the signal, the manager checks if the termination wasthrough normalexit() by calling the system call,waitpid(). If that is the case, the execution is successfullydone and the local manager doen not have do anything. Otherwise, the local manager regards it as a process failureand notifies the central manager. However, it is possible for the local manager to fail as well. The central managermonitors all the local managers by pinging them periodically. If a local manager does not answer, the centralmanager assumes that the local manager has failed or hardware/network has failed. Then, it re-submits the requestto GRAM module in order to restore the failed processes from the checkpoint files.

In coordinated checkpointing, a single failure results in the rollback of all the processes to the consistent globalcheckpoint. The central manager broadcasts both of a failure event and a rollback request. Our first approach torecovery was to kill all the processes on a single failure and to recreate them by submitting sub-requests to thegatekeeper on each node. This approach took too much time for gatekeeper’s authentication and reconstructionof all the channels. To improve the efficiency of recovery, we allow the survived processes not to terminate.Instead, they dump checkpoint files to their memory by callingexec(). This mechanism affects only the userlevel memory without affecting the channel status remaining at the kernel side. After dumping the memory, thesurvived processes can communicate without any channel reconstruction. However, the channel to the failedprocess requires to be reconstructed.

5 Implementation Issues

5.1 Communication Channel Reconstruction

The original MPI specification supports static process group management. Once a process group and channelsare built, this information cannot be altered during the run time. In other words, no new process can join thegroup. A recovered process instance is regarded as new instance at the view of the group. Although it can restorethe process state, it cannot communicate with the survived processes. Our proposed solution is to invalidate com-munication channels of failed process and to reconstruct them. We have implemented a new ‘MPI Rejoin()’function for this.

Before a restored process resumes the computation, it callsMPI Rejoin() in order to update its listenerinformation of the other’scommworldchannel and to re-initiate its channel information. To update the listenerport, the restored process informs the following values:

9

(3) broadcast

Process ProcessSurvived

(4) new commworldchannel info.(1) Inform new listner info.

ManagerCentral

Manager ManagerLocal Local

Recovered

(3) broadcast

(2) forward new listner info.

(4) new commworldchannel info.

Figure 8. Communication channel reconstruction protocol

MPI_function() | signal_handler()... | if ( ft_globus_mutex == 1 ) thenrequested_for_coordination = FALSE | requested_for_coordination = TRUE;ft_globus_mutex = 1; | else

| do_checkpointing();... | return;

| if (requested_for_coordination == TRUE) then |

do_checkpoint(); |endif |ft_globus_mutex = 0; |

|

Figure 9. Atomicity of message transfer

global rank: the logical process ID of the previous run.

hostname: the address of the current node where the process is restored.

port number: the new listener port number.

This information is broadcasted via hierarchical managers. The survived processes invalidate the channel to thefailed process or renew the listener information of the restored process according to the event information fromits local manager. Figure 8 describes the interaction among managers and processes for theMPI Rejoin() call.The restored process sets its handles free to re-initiate channels so that it can consider as if it has not created anychannel to the others. The procedure of channel reconstruction is the same as that of the channel creation (asdescribed in Section 3.2.)

5.2 Atomicity of Message Transfer

Messages in MPICH-G2 are sent being divided into the header and the payload. If checkpointing is performedafter receiving a header but before receiving the payload, partial message loss may happen. We provide theatomicity of message transfer in order to store and restore the communication context safely. In other words,checkpointing is not performed while the message transfer is in progress. We have made the MPI communicationoperation mutually exclusive by using the checkpoint signal handler as shown in Figure 9.

Each mutually exclusive area for send and receive operations has been implemented in different levels. Weset the whole send-operation area (MPI Send and MPI Isend) as a critical section. The process status in

10

checkpoint files should be either that any send-operation has not been called or that a message has been sentcompletely. We do not want a checkpoint file to contain send-requests in the send-queue because each send-request is related with the physical channel. If a process restores its state with send-queue entries, some of themmay not be sent correctly because of channel updates. The kernel does not process any non-blocking send-requestuntil a polling functionMPI Wait() is called explicitly in the user code. MPICH-GF replaces the non-blockingsend-operation into the blocking send-operation to ensure that no send-request exists in a checkpoint file. Thisreplacement does not affect the performance or the correctness of the process because blocking operation justsubmits its request to the kernel, but does not wait for the delivery of the message at the receiver side.

The implementation level of the receive-operation’s critical section is lower than that of the send-operation. Ifthe whole receive-operation is implemented as a critical section, the deadlock situation may happen in coordi-nated checkpointing (Figure 10.) In the figure, a sender is waiting that all the processes enter in the coordinationprocedure and a receiver is waiting for the arrival of the requested message. At the receiver side, the signal forcoordination is delayed until the receive-call finishes. The blocking receive is a combination of the non-blockingreceive and the loop of the polling function, and the actual message delivery from the kernel to the user memoryis done by polling. To prevent the deadlock, we set the polling function in loop as a critical section. The check-point file may contain receive-requests in the receive-queue, which does not matter because they do not relatedwith physical channel information. In order to match the arrived message and the receive-request, MPI’s receivemodule checks only sender’s rank and message tag. So the recovered process can receive messages correspondingto the restored receive-requests. For that reason, the non-blocking receive does not need to be replaced with theblocking one.

CheckpointingRequest for

CoordinationDealyed

P P

Waiting for

Coordination

Send()

Recv()

Figure 10. Deadlock of blocking operation in coordinated checkpointing


In this section, we present performance results for five Nas Parallel Benchmarks applications [26] – LU, BT,CG, IS and MG – executing on a cluster-based grid system of four Intel Pentium III 800 MHz PCs with 256MBmemory connected by ordinary 100 Mbps Ethernet. Linux kernel version 2.4 and Globus toolkit version 2.2 havebeen installed. We have measured the overhead of checkpointing and the cost of recovery.

Table 1 shows the characteristics of the applications. Application IS uses all-to-all collective operations only.MG and LU use anonymous receive-operations withMPI ANY SOURCE option. However, the correspondingsender for the receive call is determined statically by message tagging. LU is the most communication intensiveapplication. However its message size is the smallest. Each process of CG has two or three neighbors and itcommunicates with only one of them for an iteration. The message size of CG is relatively large. BT is also a

11

communication intensive application. The process has six neighbors in all directions of a 3D cube. However, allof them perform non-blocking operations for every iteration.1

Application(Class)

Description CommunicationPattern

AverageMessageSize(KB)

Number of SentMessages perprocess

Executablefile size (KB)

AverageCheckpointSize (MB)

BT (A) Navier-Stokes Equation 3D Mesh 112.3 2440 1018.9 86.8CG (B) Conjungate gradient method Chain 144.9 7990 993.2 125.9IS (B) Integer Sort All-to-all 2839.2 106 901.1 154.9LU (A) LU Decomposition Mesh 3.8 31526 1054.7 18.1MG (A) Multiple Grid Cube 35.4 3046 952.3 124.2

Table 1. Characteristics of the NPB applications used in the experiments

We have measured the total execution time of NPB applications using MPICH-G2 and MPICH-GF respectively.To evalulate the checkpointing overhead, we varied the checkpoint periods from 10% to 50% of the total executiontime. With 10% checkpoint period, applications perform checkpointing 9 times while only one checkpointing isdone with 50% checkpoint period. Figure 11 (a) shows the total execution times of each case without any failureevent. To evaluate the monitoring overhead, we measured the total execution time of “MPICH-G2” and “MPICH-GF” without checkpointing respectively. In this experiment, the central manager queries local managers for theirstatus every five seconds. If a local manager does not reply, it is assumed to have failed. The difference betweenfirst two cases in Figure 11 indicates the monitoring overhead. The other 5 cases show the execution time usingMPICH-GF with 1,2,3,4,9 checkpointings performed respectively. Figure 11 (b) shows the average checkpointingoverhead measured at the process. The checkpointing overhead consists of the communication overhead, barrieroverhead and the disk overhead. We assume that the communication overhead as the network delay among thehierarchical managers in single coordinated checkpointing. We used the optionO SYNC at file open in order toguarantee that the blocking disk writes have been flushed physically, which is considerably time-consuming. Asshown in the figure, the barrier overhead is pretty small and the communication overhead is about the same for allthe applications. The disk overhead is the dominant overhead for the checkpointing and is almost proportional tothe checkpoint file size as expected. The barier overhead of IS (B Class) application is exceptionally large. It isdue to the in-transit messages of IS whose sizes are the largest among all applications (Table 1). During the barrieroperation, they are pushed to the user-level memory of receiver. Hence, it takes more time for IS to complete thebarrier operation.

Incremental checkpointing and forked checkpointing technique may be applied to minimize this disk overhead.However, we are not sure that the benefits of incremental checkpointing would be effective as expected since all theapplications used have relatively small executables (about 1MB) while using a lot of data segment. For example,we have measured the checkpoint file size of IS application with different problem sizes: class A and class B. Thecheckpoint file sizes are 41.7 MBytes and 155.8 MBytes respectively while the executable remains the same. Thedifference of checkpoint file sizes is caused by the heap and the stack and they tend to be changed frequently.

In order to measure the recovery cost, we have measured the time at the central manager from the failuredetection to the completion of the channel update broadcast. The recovery cost per single failure is presented inFigure 12. The recovery is performed as follows: failure detection, failure broadcast, authentication for the requestand process re-launching. Among them, the authentication is the most dominant factor in recovery and it is closelyrelated to the workings of the Globus middleware. The second dominant overhead is the process re-launching andmost of the time is spent in reading the checkpoint file from hard disk. We note that both significant overheadfactors are not related to the scalability issues.

1Collective operations are implemented by calling non-blocking P2P operations.

12

0

100

200

300

400

500

600

700

800

900

BT

(A)

CG

(B)

IS (B

)LU

(A)

MG

(A)

time (sec)

MP

ICH

-G2

MP

ICH

-GF

(no ckpt.)

MP

ICH

-GF

(T=

50%)

MP

ICH

-GF

(T=

33%)

MP

ICH

-GF

(T=

25%)

MP

ICH

-GF

(T=

20%)

MP

ICH

-GF

(T=

10%)

(a)

0 1 2 3 4 5 6 7 8 9

BT

(A)

CG

(B)

IS (B

)LU

(A)

MG

(A)

time (sec)

Com

munication overhead

Disk overhead

Barrier overhead

(b)

Figure

11.Failure-freeoverhead:

(a)totalexecution

time

(b)com

positionofthe

checkpointingover-

head

7C

onclusions

Inthis

paper,w

ehave

presentedthe

feasibility,architecture

andevaluation

offault-tolerant

systemfor

grid-enabled

MP

ICH

.Our

proposedsystem

minim

izesthe

lossofcom

putationofprocesses

byperiodic

checkpointingand

guaranteesthe

consistentrecovery.Itdoes

notrequireany

modification

ofapplicationsource

codesor

MP

ICH

upperlayer.

While

previousresearches

havem

odifiedthe

higherlevel

yieldingthe

performance,

MP

ICH

-GF

respectsthe

comm

unicationcharacteristic

ofM

PIC

Hby

lower

levelapproach.C

onsiderationon

comm

unicationcontext

ofthe

lower

levelm

akesfine

grainof

checkpointtim

ingpossible.

How

ever,all

ofour

jobhave

beenaccom

plishedat

theuser

level.M

PIC

H-G

Finter-operates

with

Globus

toolkitand

canrestore

afailed

processin

anynode

acrossdom

ains,only

ifG

AS

Sservice

isvalid.

We

alsohave

shown

theim

plementation

issueand

theevaluation

ofcoordinated

checkpointing.A

tthe

pointof

writing

thispaper,

implem

entationof

independentcheckpointing

with

message

loggingis

undergoing.

The

centralmanager

isexposed

toa

singlepoint

offailure.W

eplan

touse

redundancyon

thecentralm

anagerfor

thehigh

availability.O

urultim

atetarget

gridsystem

isa

pileof

clusters.Localfailures

happeninginside

acluster

shouldrather

bem

anagedin

thatcluster.A

sshow

nin

ourexperim

entalresults,the

recoverycost

containsauthentication

overheadw

hichis

time-consum

ing.S

incethe

originaltaskrequest

hasbeen

alreadyauthenticated,

recoveryrequest

may

skipthis

procedure.In

orderto

manage

thislocal

failureefficiently,

more

hierarchicalm

anagement

architectureis

required.C

urrently,w

eare

developingthe

multiple

hierarchicalm

anagersystem

byredundancy.

References

[1]A

.Agbaria

andR

.Friedm

an.Starfish:F

ault-tolerantdynamic

mpiprogram

son

clustersofw

orkstations.InP

roceedingsofIE

EE

Symposium

onH

ighPerform

anceD

istributedC

omputing,1999.

[2]L.A

lvisi,E.N

.Elnozahy,S

.Rao,S

.A.H

usain,andA

.D.M

el.A

nanalysis

ofcomm

unicationinduced

checkpointing.In

Symposium

onFault-TolerantC

omputing,pages

242–249,1999.

13

0 2 4 6 8 10 12

BT

(A)

CG

(B)

IS (B

)LU

(A)

MG

(A)

time (sec)

Process re-launching

Job re-submission

Broadcasting failure event

Figure

12.Recovery

cost

[3]L.

Alvisi

andK

.M

arzullo.M

essagelogging:

Pessim

istic,optim

istic,causal

andoptim

al.IE

EE

Transactionson

Software

Engineering,24(2):149–159,F

EB

1998.[4]

R.

Batchu,

A.

Skjellum

,Z

.C

ui,M

.B

eddhu,J.

P.

Neelam

egam,

Y.D

andass,and

M.

Apte.

MP

I/FT

:architectureand

taxonomies

forfault-tolerant,m

essage-passingm

iddleware

forperform

ance-portableparallelcom

puting.In

1stInter-nationalSym

posiumon

Cluster

Com

putingand

theG

rid,May

2001.[5]

A.B

eguelin,E.S

eligman,and

P.S

tephan.A

pplicationlevelfaulttolerance

inheterogenous

networks

ofworkstations.

JournalofParallelandD

istributedC

omputing,43(2):147–155,1997.

[6]G

.B

osilca,A

.B

outeiller,F.

Cappello,

S.

Djilali,

G.

F.M

agniette,V.

N´

eri,and

A.

Selikhov.

MP

ICH

-V:

Toward

ascalable

faulttolerantMP

Iforvolatile

nodes.InSuperC

omputing

2002,2002.[7]

G.

Burns,

R.

Daoud,

andJ.

Vaigl.

LAM

:A

nopen

clusterenvironm

entfor

mpi.

InP

roceedingof

Supercomputing

Symposium

,pages379–386,Toronto,C

anada,1994.[8]

R.B

utlerand

E.L.Lusk.

Monitors,m

essages,andclusters:

The

p4parallelprogram

ming

system.

ParallelCom

puting,20(4):547–564,1994.

[9]K

.M.C

handyand

L.Lamport.

Distributed

snapshots:D

etermining

globalstatesofdistributed

systems.

AC

MTrans-

actionson

Com

putingSystem

s,3(1):63–75,AU

G1985.

[10]Y.C

hen,K.Li,and

J.S.P

lank.C

LIP:A

checkpointingtoolfor

message-passing

parallelprograms.

InP

roceedingsof

SC97:

High

Performance

Netw

orking&

Com

puting,NO

V1997.

[11]I.B

.M.C

orporation.IB

Mloadleveler:

User’s

guide,SE

P1993.

[12]E

.N.E

lnozahy,L.Alvisi,Y.-M

.Wang,and

D.B

.Johnson.Asurvey

ofrollback-recoveryprotocols

inm

essage-passingsystem

s.AC

MC

omputing

Surveys,34(3):375–408,2002.[13]

G.

E.

Fagg

andJ.

Dongarra.

FT-M

PI:

Fault

tolerantM

PI,

supportingdynam

icapplications

ina

dynamic

world.

InP

VM

/MP

I2000,pages

346–353,2000.[14]

M.P

.I.Forum

.M

PI:a

message

passinginterface

standard,MA

Y1994.

[15]I.F

osterand

N.T.K

aronis.A

grid-enabledM

PI:M

essagepassing

inheterogeneous

distributedcom

putingsystem

s.In

Proceedings

ofSC98.A

CM

Press,1998.

[16]I.

Foster

andC

.K

esselman.

The

globusproject:

Astatus

report.In

Proceedings

ofthe

Heterogeneous

Com

putingW

orkshop,pages4–18,1998.

[17]I.F

osterandC

.Kesselm

an.T

heG

rid:B

lueprintfora

Future

Com

putingInfrastructure.

Morgan

Faufm

annP

ublishers,1999.

[18]I.F

oster,C.K

esselman,and

S.T

uecke.T

heanatom

yofthe

grid:E

nablingscalable

virtualorganizations.International

J.Supercomputer

Applications,15(3),2001.

14

[19] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A computation management agent formulti-institutional grids. InProceedings of the Tenth IEEE Symposium on High Performance Distributed Comput-ing (HPDC10), AUG 2001.

[20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. MPICH: A high-performance, portable implementation of the MPIMessage Passing Interface Standard.Parallel Computing, 22(6):789–828, 1996.

[21] R. Koo and S. Toueg. Checkpointing and rollbackrecovery for distributed systems.IEEE Transaction on SoftwareEngineering, SE-13(1):23–31, 1987.

[22] W.-J. Li and J.-J. Tsay. Checkpointing message-passing interface(MPI) parallel programs. InPacific Rim InternationalSymposium on Fault-Tolerant Systems (PRFTS), 1997.

[23] M. J. Litzkow and M. Solomon. Supporting checkpointing and process migration outside the unix kernel. InUSENIXConference Proceedings, pages 283–290, San Francisco, CA, JAN 1992.

[24] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou. Portable fault tolerance scheme for MPI.Parallel ProcessingLetters, 10(4):371–382, 2000.

[25] K. Z. Meth and W. G. Tuel. Parallel checkpoint/restart without message logging. InProceedings of the 2000 Interna-tional Workshops on Parallel Processing, 2000.

[26] NASA Ames Research Center. Nas parallel benchmarks. Technical report, http://science.nas.nasa.gov/Software/NPB/,1997.

[27] R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots.IEEE Transacsions on Paralleland Distributed Systems, 6(2):165–169, 1995.

[28] N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. InSympo-sium on Fault-Tolerant Computing, pages 58–67, 1998.

[29] G. T. Nguyen, V. D. Tran, and M. Kotocov´a. Application recovery in parallel programming environment. InEuropeanPVM/MPI, pages 234–242, 2002.

[30] A. Nguyen-Tuong.Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia,USA, 2000. Partial Fullfillment of the Requirements for the Degree Doctor of Computer Science.

[31] J. S. Plank, M. beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. InUSENIX Winter 1995Technical Conference, JAN 1995.

[32] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing.IEEE Transactions on Parallel and Distributed Systems,9(10):972–986, 1998.

[33] S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. InSymposium onFault-Tolerant Computing, pages 48–55, 1999.

[34] S. H. Russ, J. Robinson, B. K. Flachs, and B. Heckel. The hector distributed run-time environment.IEEE Transactionson Parallel and Distributed Systems, 9(11):1102–1114, NOV 1998.

[35] L. Silva and J. ao. Silva. System-level versus user-defined checkpointing. InSymposium on Reliable DistributedSystems 1998, pages 68–74, 1998.

[36] G. Stellner. CoCheck: Checkpointing and process migration for MPI. InProceedings of the International ParallelProcessing Symposium, pages 526–531, APR 1996.

[37] J. Tsai, S.-Y. Kuo, and Y.-M. Wang. Theoretical analysis for communication-induced checkpointing protocols withrollback dependency trackability.IEEE Transactions on Parallel and Distributed Systems, 9(10):963–971, 1998.

[38] V. Zandy, B. Miller, and M. Livny. Process hijacking. InEighth International Symposium on High PerformanceDistributed Computing, pages 177–184, AUG 1999.

15

Ω Line Problem in a Consistent Global Recovery Line ∗

†MaengSoon Baik †SungJin Choi ‡JoonMin Gil ‡ChanYeol Park §HyunChang Yoo †ChongSun Hwang

†Dept. of Computer Science & Engineering Korea University, South Korea

(msbak, lotieye and hwang)@disys.korea.ac.kr

‡Supercomputing Center, Korea Institute of Science and Technology Information, South Korea

(jmgil and chan)@kisti.re.kr

§Dept. of Computer Science & Education Korea University, South Korea

[email protected]

Abstract

Optimistic log-based rollback recovery protocols have been regarded as an attractive fault-tolerant solution in

distributed systems based on message-passing paradigm due to low overhead in failure-free time. These protocols

are based on a Piecewise Deterministic(PWD) Assumption model. They, however, assumed that all logged non-

deterministic events in a consistent global recovery line must be determinately replayed in recovery time. In this

paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent global

recovery line as a Ω Line Problem, because of asynchronous properties of distributed systems: no bound on the

relative speeds of processes, no bound on message transmission delays and no global time source. In addition,

we propose a new optimistic log-based rollback recovery protocol, which guarantees the deterministic replaying of

all logged non-deterministic events belonged in a consistent global recovery line and solves a Ω Line Problem in

recovery time.

Keywords : Distributed Systems, Fault-tolerance, Optimistic Log-based Rollback Recovery Protocol, Piecewise

Deterministic Assumption Model, Ω Line Problem.

∗This work was supported by grant No. G-03-GS-01-04X-6 from the Korea Institute of Science and Technology Information

1 Introduction

In distributed systems, long-running distributed application programs run the risk of failing completely even if

one of the processes on which they are executing should fail[1]. The use of rollback recovery protocol can be an

attractive method of providing fault-tolerance that can recover the execution of program after a failure. During

failure-free execution, state information about each process is saved on stable storage so that the process can later

be restored to certain earlier states of its execution[1, 5]. After a failure, if a consistent global system state can be

formed from individual process’s states, the system as a whole can be recovered and the resulting execution of the

system will then be equivalent to some possible failure-free execution[2, 4, 12]. Rollback recovery protocol can be

made completely transparent to the application program, avoiding special efforts by the application programmer

and allowing existing programs to easily be made fault-tolerant[2, 18].

With transparent rollback recovery protocols, they are classified into checkpoint-based rollback recovery protocol

and log-based rollback recovery protocol [1, 6]. In the checkpoint-based rollback recovery protocol, each process in

systems saves its state information in preparation for occurrence of a failure. The saved state information is called

by a checkpoint. The methods of taking checkpoint are classified into independent checkpointing and coordinated

checkpointing [2, 9, 11, 13, 16]. Independent checkpointing minimizes the overhead at failure-free execution by

independently taking checkpoint without synchronizing with other processes. On the other hand, coordinated

checkpointing forces the synchronization with other process whenever each process takes checkpoint. Log-based

rollback recovery protocols are usually classified into optimistic log-based rollback recovery protocol and pessimistic

log-based rollback recovery protocol [3, 6, 7]. In the optimistic log-based rollback recovery protocol , the determinants,

context and delivery order of message, of received messages are logged at volatile memory before processing and

logged on stable storage at adequate time of process[4, 6, 8]. On the other hand, pessimistic log-based rollback

recovery protocol enables to recover until the state before occurrence of a failure by logging determinants of all

nondeterministic events on stable storage before processing[6, 14, 17].

Log-based rollback recovery protocols are popular for building a fault-tolerant systems that can tolerate process

crash failures, since they guarantee that all logged non-deterministic events in a consistent global recovery line can

be determinately replayed beyond the recent checkpoint[2, 3]. These protocols require that each process periodically

2

records its local state and logs the messages it received after having recorded that state. When a process crashes,

a new process is created in its place. The new process is given the appropriate recorded local state and then it is

sent the logged messages in the order they were originally received[5]. Thus, log-based rollback recovery protocols

implement an abstraction of a resilient process in which the crash of a process is translated into intermittent

unavailability of that process.

All log-based rollback recovery protocols require that once a crashed process recovers, its state is consistent with

the states of other processes[6]. This consistency requirement is usually expressed in terms of orphan processes,

which are surviving processes whose states are inconsistent with the recovered state of a crashed process. Thus,

log-based rollback recovery protocols guarantee that no process is an orphan in recovery time.

In log-based rollback recovery protocol, previous optimistic log-based rollback recovery protocols assumed that all

logged non-deterministic events in a consistent global recovery line must be determinately replayed. We, however,

detect that some logged non-deterministic events in a consistent global recovery line cannot be determinately

replayed in recovery time because of asynchronous properties of distributed systems, no bound on the relative

speeds of processes, no bound on message transmission delays and no global time source. As a result, some process

may come to be an orphan process in recovery time, although they are not any orphan process in a consistent

global recovery line.

In this paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent

global recovery line in recovery time as a Ω Line Problem. In addition, we propose a new optimistic log-based

rollback recovery protocol, which guarantees the deterministic replaying of all logged non-deterministic events in a

consistent global recovery line when a failure is occurred in systems and solves a Ω Line Problem in recovery time.

The rest of this paper is organized as follows: Section 2 presents the system models assumed in this paper.

Section 3 presents the impossibility of deterministic replaying of logged non-deterministic event in a consistent

global recovery line in recovery time. Section 4 describes a new optimistic log-based rollback recovery protocol,

which guarantees the deterministic replaying of all logged non-deterministic event in consistent global recovery line

in recovery time. Finally, we conclude this paper in Section 5.

3

2 System Models

The system assumed in this paper is distributed system consisting of processes[12]. Processes communicate with

each other only by exchanging messages. The distributed system is asynchronous: there exists no bound on the

relative speeds of processes, no bound on message transmission delays and no global time source[12].

0p

1p

2p

0m

1m

2m

3m

4m

7m

1

0ps

1

1ps

1

3ps

1

2ps

0

0ps

0

1ps

0

2ps

2

0ps

2

1ps

2

2ps

6m

0

3ps

2

3ps

5m 8m

0

12 ( )pLog s m

0

24 ( )pLog s m

0

36 ( )pLog s m

1

37 ( )pUnLog s m

1

10 ( )pLog s m

1

23 ( )pLog s m

2

25 ( )pLog s m

2

38 ( )pLog s m2

11 ( )pLog s m

Figure 1. The execution of processes in systems based on PWD model

Processes in systems execute events, including message sending events, message receiving events and local

events. These events are ordered by the irreflexive partial order, happen- before relation, that represents potential

causality[1]. We assume that the channels between processes do not reorder messages and are reliable. Outside of

satisfying the happens-before relation, the exact order a process executes message receiving events depends on other

factors, including process scheduling, routing and flow control. Thus, after a recovery processing is successfully

terminated, a recovering process may not produce the same result upon recovery, even if the same set of messages

are sent to it since they may not be redelivered in the same sequence as before failure is occurred. Previous rollback

recovery protocols represent this non-determinism by creating a non-deterministic local event, message delivering

that corresponds to the delivery of a received message to be delivered to the application. The only constraints are

that a message eventually delivers all messages that it receives.

An execution of each process in systems is based on PWD model. In systems based on PWD model, an execution

4

of each process is determined by the initial state of process and the delivery order and content of received messages.

Thus, if the delivery order and content of received messages is maintained, then the execution of process is able to

be determinately replayed[2, 4]. In Figure 1, the whole execution of process p1 is determined by the initial state

and delivery orders and contents of received messages m0, m3 and m7. The formal representation of process’s

execution in systems based on PWD model is as follows[2].

Ex(pi) = Init(pi) +⊎→

j∈I sjpi

(mk)

Ex(pi) : the whole execution state of process pi

Init(pi) : the initial state of process pi

+ : sequential work flow of a process

⊎→j∈I : the set of sequential work flows of a process having ascending order j

I : state interval number

sjpi

(mk) : the j − th state interval of process pi occurred by message mk

(1)

The state interval of process is started by delivery of receiving message and terminated by delivery of newly

receiving message. In Figure 1, the state interval s1p2

of process p2 is started by delivery of receiving message m1

from process p1 and terminated by delivery of receiving message m5 from process p1. In addition, events of sending

messages in same state interval are determinately executed. In Figure 1, three events of sending messages m2, m3

and m7 are determinately executed in state interval s1p2

. The whole execution state of process is a sequential work

flow of state intervals. In Figure 1, the whole execution state of process p1 is composed of the initial state s0p1

and

the sequential set of state intervals s1p1

, s2p1

and s3p1

. The formal representation of the whole execution state of

process p1 is as follows.

Ex(p1) = s0p1] s1

p1(m0) ] s2

p1(m3) ] s3

p1(m7) (2)

Processes in systems can fail independently according to the fail-stop model. In this model, processes fail by

halting, and the fact that a process has halted is eventually detectable by all surviving processes.

5

3 Impossibility of Deterministic Replaying of Logged Non-deterministic Event in a

Consistent Global Recovery Line

3.1 Previous consistency condition in optimistic log-based rollback recovery protocol

In optimistic log-based rollback recovery protocols, previous studies assumed that the execution of process is

based on piecewise deterministic assumption model. In this model, the execution of each process is assumed to

take place as a series of state intervals in which events take place. For example, an event may be the execution of

an instruction, the sending of a message, and so on. Each state interval in the piecewise deterministic assumption

model is assumed to start with a non-deterministic event, such as the receipt of a message. However, from that

moment on, the execution of the process is completely deterministic. A state interval ends with the last event

before a non-deterministic event occurs.

In effect, a state interval can be replayed with a known result, that is, in a completely deterministic way, provided

it is replayed starting with the same non-deterministic event as before. Consequently, if rollback recovery protocol

records all non-deterministic events in such a model, it becomes possible to completely replay the entire execution

of a process in a deterministic way.

In optimistic log-based rollback recovery protocols, the main issue is making the system state to be consistent

after occurrence of a failure[4, 6, 10]. To make system state be consistent, no message must be an orphan message

and no process must be an orphan process in recovery time. When a delivery event of message is recorded in

receiver process, but a sending event of message is not recorded in sender process, this message comes to be an

orphan message and receiver process of this message comes to be an orphan process in recovery time. In log-based

rollback recovery protocol, an orphan message and an orphan process are defined as follows.

Definition 1 (Orphan Message) If ∃ m DE(m) ∈ Log(skpi

(m)) and SE(m) ∈ UnLog(slpj

(m′)) such that pi

is a receiver process and pj is a sender process in message m and UnLog(slpj

(m′)) is lost due to a failure, then

message m comes to be an orphan message, ψ(m), in recovery time.

6

Definition 2 (Orphan Process) If followed two conditions are satisfied in recovery time, then process pi comes

to be an orphan process, ψ(p).

• (Before Failure) : ∃ m DE(m) ∈ Log(skpi


(m′)) such that pi is a receiver

process and pj is a sender process in message m

• (After Failure) : The UnLog(slpj

(m′)) saved at volatile memory in process pj is lost due to a failure.

In Definition 1 and Definition 2, skpi

(m) means the k-th state interval occurred by delivery event of message

m in process pi. The DE(m) indicates the delivery event of message m. The SE(m) indicates the sending event

of message m. The Log(skpi

(m)) means that the message context and delivery order, determinant, of message m

is logged on stable storage in process pi. The slpj

(m′) indicates the l-th state interval occurred by message m′

in process pj . The UnLog(slpj

(m′)) means that the determinant of message m′ is not logged on stable storage

in process pj . The UnLog(slpj

(m′)) may be lost due to a failure. In Figure 1, the message m6 is not an orphan

message because the determinant of message m3 in logged on stable storage in process p1. But, the message m8

may come to be an orphan message because the determinant of message m5 is not logged on stable storage in

process p1.

In log-based rollback recovery protocol, when a failure is occurred in any process, the set of state intervals in

each process is called by recovery line. The definition of recovery line is as follows.

Definition 3 (Recovery Line) When a failure is occurred in any process, the set of state intervals of each process

∪ℵi=0skpi

is a recovery line. (k is the state interval number of each process and ℵ is the number of all processes in

systems)

In addition, the recovery line containing an orphan message and an orphan process is called by an inconsistent

recovery line and the recovery line not containing any orphan message and orphan process is called by a consistent

global recovery line[3, 15]. Followed two definitions define an inconsistent recovery line and a consistent global

recovery line, respectively.

7

Definition 4 (Inconsistent Recovery Line) When a failure is occurred in any process, if ∃ m DE(m) ∈

Log(skpi


(m′)), pi is a receiver process and pj is a sender process in message m,

skpi

(m) and slpj

(m′) ∈ ∪ℵi=0skpi

and UnLog(slpj

(m′)) saved at volatile memory in process pj is lost due to a failure,

then ∪ℵi=0skpi

is an inconsistent recovery line.

Definition 5 (Consistent Global Recovery Line) When a failure is occurred in any process, if @ m, ∀i, j

DE(m) ∈ Log(skpi


(m′)) such that pi is a receiver process and pj is a sender process

in message m, skpi

(m) and slpj

(m′) ∈ ∪ℵi=0skpi

and UnLog(slpj

(m′)) saved at volatile memory in process pj is lost

due to a failure, then ∪ℵi=0skpi

is a consistent global recovery line.

In Figure 2, when a failure is occurred at state interval s3p1

and UnLogs3p1

(m7) is lost due to a failure, the set

of state intervals s3p0

, s2p1

and s2p2

comes to be a consistent global recovery line because it does not contain any

orphan message and orphan process. However, the set of state intervals s3p0

, s2p1

and s3p2

comes to be an inconsistent

recovery line because the message m8 comes to be an orphan message and process p2 comes to be an orphan process

in recovery time.

0p

1p

2p

0m

1m

2m

3m

4m

5m

1

0ps

1

1ps

1

3ps

1

2ps

0

0ps

0

1ps

0

2ps

2

0ps

2

1ps

2

2ps

6m

0

3ps

2

3ps

7m 8m

0

12 ( )pLog s m

0

24 ( )pLog s m

0

36 ( )pLog s m

1

37 ( )pUnLog s m

1

10 ( )pLog s m

1

23 ( )pLog s m

2

25 ( )pLog s m

2

38 ( )pLog s m2

11 ( )pLog s m

(0, 0, 0) (1, 1, 1) (2, 2, 1) (3, 2, 1)

(0, 0, 0) (0, 1, 0) (0, 2, 1) (0, 3, 1)

(0, 0, 0) (0, 1, 1) (0, 2, 2) (0, 3, 3)

Figure 2. Two recovery lines in optimistic log-based rollback recovery protocol

8

Previous optimistic log-based rollback recovery protocols, they all assumed that the logged non-deterministic

events in a consistent global recovery line must be determinately replayed in recovery time. In Figure 2, if a failure

is occurred at state interval s3p1

in process p1 after message m6 is sent to the process p1 and a consistent global

recovery line consists of state intervals s3p0

, s2p1

and s2p2

, then process p1 can determinately replay the sending

event of message m6 at state interval s2p1

in recovery time. However, we detect that the message m6 may not

be determinately sent at state interval s2p1

to process p0 in recovery time because of asynchronous properties

in distributed systems: no bound on the relative speeds of processes, no bound on message transmission delays

and no global time source. In next subsection, we show the impossibility of deterministic replaying of logged

non-deterministic event in a consistent global recovery line.

3.2 Impossibility of deterministic replaying of logged non-deterministic event in a consistent global recov-

ery line

In asynchronous distributed systems, there are no bounds on the relative speeds of processes, message trans-

mission delays and global time source. No bounds on the relative speeds of processes means that execution of

processes in recovery time may be different from execution before a failure. In addition, no bounds on message

transmission delays means that the receiving order of message in recovery time may be different from the receiving

order before a failure. Finally, no bounds on global time source means that concurrent events in different processes

cannot be coordinated without message exchanging. In previous optimistic log-based rollback recovery protocol,

they all assumed that message sending events in same state interval can be determinately replayed in recovery

time, if a state interval is contained in a consistent global recovery line. Thus, the message sent in a state interval

that its determinant is logged on stable storage and contained in a consistent global recovery line cannot be an

orphan message. In Figure 2, the message m6 is not an orphan message in a consistent global recovery line s3p0

,

s2p1

and s2p2

, because the determinant of state interval s2p1

is logged on stable storage in process p1 and process p0

is not an orphan process in recovery time.

To show that previous optimistic log-based rollback recovery protocol how to determine a consistent global

recovery line after occurrence of a failure, we assume that a failure is occurred in 4-th state interval s3p1

in process

p1 in Figure 2 and determinants of state intervals s1p1

and s2p1

are logged on stable storage, but the determinant

9

of state interval s3p1

not logged and lost due to a failure. In recovery time, systems determines a consistent global

recovery line as state intervals s3p0

, s2p1

and s2p2

, and processes p0, p1 and p2 determinately replays up to state

intervals s3p0

, s2p1

and s2p2

, respectively. In previous optimistic log-based rollback protocol, the powerful method

for determining a consistent global recovery line is using the transitive dependency vector. In this method, each

process piggybacks the transitive dependency vector maintained by its state interval on every sending message.

The receiver process of message updates its transitive dependency vector comparing to piggybacked transitive

dependency vector on message before delivering message. The updating procedure of transitive dependency vector

is as follows[1, 2].

Update(SN(p0), . . . , SN(pi), . . . , SN(pℵ)pi

= (Max(SN(p0)pi, . . . , SN(pi)pi

+ 1, . . . ,Max(SN(pℵ)pi, SN(pℵ)m))

(3)

When a failure is occurred, systems automatically detects it and restarts a process occurred a failure. Restarted

process determines the last logged state interval number and broadcast its failure occurrence to all processes in

systems. Each process received this broadcast message determines the last state interval consistent with the last

logged state interval of restarted process. This procedure proceeds until determining a consistent global recovery

line in systems. To determine whether recovery line is consistent or inconsistent, previous optimistic log-based

rollback recovery protocols constructs the dependency matrix. In Figure 2, the recovery line, represented by solid

line, consisted of stated intervals s3p0

, s2p1

and s2p2

presented by followed dependency matrix.

[3] 2 1

0 [2] 1

0 2 [2]

(4)

The matrix presented a consistent global recovery line must satisfy one property that all diagonal elements in

matrix are greater or equal than other elements in same column. In Figure 2, an inconsistent recovery line consisted

of state intervals s3p0

, s3p1

and s3p2

, presented by dotted line, has the dependency matrix as follows.

10

[3] 2 1

0 [2] 1

0 3 [3]

(5)

In this matrix, 3× 2 element is greater than 2× 2 element. That is, the state interval s3p2

in process p2 depends

the state interval s3p1

, but s3p1

is lost due to a failure and process p2 comes to be an orphan process.

However, we detect that deterministic message sending event in consistent global recovery line may not be

replayed in recovery time. To present this situation, we assumed that a failure is occurred in a state interval s3p1

in process p1 and the logging information for determinant of message m7 is lost due to a failure in Figure 3. In

recovery time, systems determines a consistent global recovery line as a set of state intervals s3p0

, s2p1

and s2p2

and

these state intervals consists a followed dependency matrix.

[3] 2 1

0 [2] 1

0 2 [2]

(6)

In this dependency matrix, diagonal elements are greater or equal than other elements in same column. But,

if process p1 receives the message m7 from process p2 after it sends messages m4 and m5 to processes p1 and p2,

but, it does not send the message m6 to process p0 in recovery time, then process p1 cannot determine whether it

delivers the message m7 or not. If message m7 arrived before process p1 does not send messages m4 and m5, then

process p1 sends messages m4 and m5 because process p1 knows that processes p0 and p3 are depended on state

interval s2p1

of process p1 by using dependency matrix representing a consistent global recovery line, 1× 2 element

and 3 × 3 element. However, it does not know that both state intervals s2p0

and s3p0

are depended on its state

intervals s2p1

by using only dependency matrix representing a consistent global recovery line. Thus, it may delivers

message m7 resent by process p3 in recovery time, delivery event of message m7 before sending message m6 to

process p0 by process p1 makes impossible for process p1 at state interval s2p1

to determinately send a message m6

to a process p0 in recovery time. As a result of this situation, a message m6 comes to be an orphan message and

11

a process comes to be an orphan process in a consistent global recovery line, s3p0

, s2p1

and s2p2, in recovery time.

Although non-deterministic message delivery event DE(m6 is belonged in a consistent global recovery line, there

is no guarantee that it can be determinately replayed in recovery time. To summarize this situation, we shows

three cases, when process p1 delivers message m7 in recovery time.

• A message m7 arrived at state interval s2p1

in process p1 before process p1 sends a message m4 to process p4

Process p1 delays a delivering event of message m7 until it sends messages m4 and m5 to processes p0 and

p2, respectively, because it knows that processes p0 and p2 are depended on its state interval s2p1

by using a

dependency matrix representing a consistent global recovery line consisted in state intervals s3p0

, s2p1

and s2p2

.

But, process p1 does not determine whether it delivers message m7 before sends message m6 to process p0 or

not, because it does not know that two state intervals s2p0

and s3p0

are all depended on its state interval s2p1

.


in process p1 after process p1 sends a message m4 to process p4

and not send a message m5 to process p2

Process p1 delays a delivering event of message m7 until it sends messages m5 to p2, because it knows that

process p2 are depended on its state interval s2p1

by using a dependency matrix representing a consistent

global recovery line consisted in state intervals s3p0

, s2p1

and s2p2

. But, process p1 does not determine whether

it delivers message m7 before sends message m6 to process p0 or not, because it does not know that two state

intervals s2p0

and s3p0


.


in process p1 after process p1 sends messages m4 and m5 to

processes p0 and p2, respectively

Process p1 does not determine whether it delivers message m7 before sends message m6 to process p0 or not,

because it does not know that two state intervals s2p0

and s3p0


.

In distributed systems, there is no guarantee that the order of message receiving in failure-free time is same

the order of message receiving in recovery time due to asynchronous properties of distributed systems: no bound

on the relative speeds of process, no bound on message transmission delays and no global time source. If a last

state interval of any process depends on two state intervals in same process in a consistent global recovery line

and process receives a message before process sends second message in recovery time, process does not determine

12

0p

1p

2p

0m

1m

2m

3m

4m

5m

1

0ps

1

1ps

1

3ps

1

2ps

0

0ps

0

1ps

0

2ps

2

0ps

2

1ps

2

2ps

6m

0

3ps

2

3ps

7m 8m

0

12 ( )pLog s m

0

24 ( )pLog s m

0

36 ( )pLog s m

1

37 ( )pUnLog s m

1

10 ( )pLog s m

1

23 ( )pLog s m

2

25 ( )pLog s m

2

38 ( )pLog s m2

11 ( )pLog s m

(0, 0, 0) (1, 1, 1) (2, 2, 1) (3, 2, 1)

(0, 0, 0) (0, 1, 0) (0, 2, 1) (0, 3, 1)

(0, 0, 0) (0, 1, 1) (0, 2, 2) (0, 3, 3)

7m

6m

7m7m

Figure 3. Ω Line Problem in optimistic log-based rollback recovery protocol

whether it delivers this message or not. If process delivers this message before sends second message, then regular

message comes to be an orphan message and regular process comes to be an orphan process in a consistent global

recovery line. In next subsection, we defines this situation as Ω Line Problem.

3.3 Ω Line problem in a consistent global recovery line

In previous subsection, we show the impossibility of deterministic replaying of logged non-deterministic event

in a consistent global recovery line. we define this situation as Ω Line problem. Followed definition shows Ω Line

problem.

Definition 6 (Ω Line Problem) In an occurrence of a failure and recovery time, if process pi executes followed

executions, then these executions create Ω Line Problem.

• (Occurrence of a failure):Before occurrence of a failure, process pi sends more than two messages,∑l

k=1 mk (l ≥

2), to process pj at a state interval, slastpi

, which is belonged in a consistent global recovery line, ∪ℵi=1slpi

, as

a last state interval of process pi after occurrence of a failure.

• (Recovery time): process pi receives message, mΩ, in its last state interval, slastpi

, in a consistent global

13

recovery line, ∪ℵi=1slpi

, before sends messages,∑l

k=1 mk (l ≥ 2), to process pj and a last state interval

spj (ml) is belonged in a consistent global recovery line, ∪ℵi=1slpi

.

4 New Optimistic Log-based Rollback Recovery Protocol

In this section, we propose a new optimistic log-based rollback recovery protocol, which guarantees the deter-

ministic replaying of all logged non-deterministic events in a consistent global recovery line and solves a Ω Line

Problem.

4.1 Data structure

Data structures maintained by process pi to recover at a occurrence of a failure are as follows.

• DeV kpi

= (SN(p0), SN(p1), · · · , k, SN(pi+1), · · · , SN(pℵ): Process pi maintains the transitive dependency

vector DeV kpi

in order to distinguish the state interval of other process having the dependency with its k-th

state interval[13]. Process pi piggybacks DeV kpi

on every message sent from its k − th state interval skpi

. In

addition, whenever process pi delivers a message, it updates new transitive vector DeV k+1pi

comparing its

DeV kpi

and transitive dependency vector piggybacked on message. This Updating procedure is presented in

Figure 4.

• ευpi

: Process pi piggybacks the direct dependency counting number ευpi

on every message sent to other processes

in a state interval. ευpi

means that a number of sending messages at a state interval to same process.

• ICNpi : process pi maintains the variables ICNpi , which means the number of occurrence a failure in systems.

4.2 Algorithm

A new optimistic log-based rollback recovery protocol proposed in this paper is consisted of two algorithms,

algorithm in failure-free time and algorithm in recovery time.

14

4.2.1 Algorithm in failure-free time

In failure-free, process pi piggybacks transitive dependency vector on every sending message. Whenever process pi

delivers some messages from other process, it updates its transitive dependency vector comparing to piggybacked

transitive vector on received message before it delivers. If some state intervals in process pj depends on a state

interval in process pi, then process pi piggybacks variable, ευpi

, on messages sent at its a state interval. The variable

ευpi

increases as the number of messages sent at a state interval in process pi. If state interval is advanced to another

state interval, then variable ευpi

is reset to 1. In failure-free time, a algorithm executed by process pi is presented

in figure 4.

——————————————————————————————————————————————————Dofailure-free execution of process pi;set DeV 0

pi= (0, 0, · · · , 0);

Whenever process pi receives a message mr from process pj , (i > jorj > i), at its state interval skpi

;update DeV k

pibefore deliver message mr;

DeV k+1pi

= (Max(SN(p0)pi , SN(p0)mr ), · · · , k + 1, · · · ,Max(SN(pℵ)pi , SN(pℵ)mr ));translate sk

piinto sk+1

pi;

Whenever process pi sends message mθ to other process pj at state interval skpi

;piggyback DeV k

pion message mθ;

if ∃mθ−1 sent from state interval skpi

to process pj then;ευpi

(pj) ++;piggyback ευ

pi(pj) on message mθ;

if skpi

transit to sk+1pi

then;ευpi

(pj) = 1;fi;

fi;oD;——————————————————————————————————————————————————

Figure 4. Algorithm in failure-free time in process pi

4.2.2 Algorithm in recovery time

If a failure is occurred in any process, then systems automatically detects it and starts with all processes a recovery

algorithm. First, process restarted by systems due to a failure determines a index number, ζ, of a state interval

lastly logged on stable storage by failed process. Failed process broadcasts the failure-occurrence message containing

this index number, ζ to all processes in systems. When other processes receives this broadcasting message, mζ ,

they determine their maximum transitive dependency vector consistent with state interval sζpfailed

and piggyback

15

this transitive dependency vector on reply message for broadcasting message. After failed process receives reply

message piggybacking sending process’s maximum transitive dependency vector consistent with sζpfailed

from all

processes in systems, it makes a dependency matrix and determines whether a set of state intervals consisting

dependency matrix is consistent or not. If a dependency matrix is not consistent and does not makes a consistent

global recovery line, failed process requests a previous transitive dependency vector to process, which comes to be

an orphan process in current recovery line until it determines a consistent global recovery line. After restarted

process determines a consistent global recovery line, it broadcast replay message containing last state interval of

each process in a consistent global recovery line. When each process replays events until a consistent global event,

it piggybacks increased ICN number on every sending message. In replaying execution, each process compares its

ICN number with piggybacked ICN number on receiving message after a failure. If piggybacked ICN number

on message is lower than its ICN number, process discards the sending message, otherwise it delivers messages

according to a consistent global recovery line. Algorithm of process pi in recovery time is presented in Figure 5.

4.3 Proof of Correctness

The correctness of proposed optimistic log-based rollback recovery protocol is proved through Theorem 1.

Theorem 1 Although ∃, ω ≥ 0 such that a state interval skpi

of process pi is belonged in a consistent global

recovery line, process pi sends messages from mψ to mψ+ω to process pj and a sate interval sηpj

(mψ+ω) of process

pj is belonged in a consistent global recovery line, proposed new recovery protocol, Υ, guarantees the deterministic

replaying of non-deterministic event in a consistent global recovery line and solves Ω Line Problem.

Proof To prove Theorem 1, we consider three cases in recvoery time and proposed recovery protocol, Υ always

solves the Ω Line Problem.

Case1: ω = 0 - If ω is equal to 0, there is no Ω Line Problem in a consistent global recovery line. In this case,

proposed protocol ,Υ, is same previous optimistic log-based rollback recovery protocol. After an occurrence

of a failure, Υ constructs a transitive dependency matrix, MTX(dep), and determines a consistent global

recovery line. Each process in systems determinately replays a non-deterministic events until a consistent

16

——————————————————————————————————————————————————When a failure is occurred in process pi;System restarts a failed process pi;Dodetermine the last state interval index number, ζ logged on stable storage in process pi before a failure;broadcast failure-message(pi, ζ) to all processes in systems;wait reply-message(pj , DeV λ

pj, ευ

pi(pj)) from all processes in systems;

continue;if process pi receives reply-message from all process then

make transitive dependence matrix MTX(dep) from∑ℵ

k=0 DeVpksuch as

a0,0(p0) a0,1(p0) · · · a0,ℵ−1(p0)a1,0(p1) a1,1(p1) · · · a1,ℵ−1(p1)

......

......

aℵ−1,0(pℵ−1) aℵ−1,1(pℵ−1) · · · aℵ−1,ℵ−1(pℵ−1)

;

if∀i (i = 0, 1, · · · ,ℵ − 1) ai,i ≥ ai,k in MTX(dep) thendetermine MTX(dep) is a consistent global recovery line;

fi;else send request-message(previous transitive dependency vector) to processes having

inconsistent relation with other processes;until process pi constructs transitive dependency vector representing a consistent global recovery line;oD;

In replaying non-deterministic events in a consistent global recovery line;Do;if ∃j, ευ

pi(pj) > 1 in last state interval belonged in a consistent global recovery line then

delay process pi deliver new message mnew;until process pi sends ευ

pi(pj)-th messages to process pj ;

fi;oD;——————————————————————————————————————————————————

Figure 5. Algorithm in recovery time in process pi

global recovery line. Because there is no Ω Line Problem in a consistent global recovery line, all non-

deterministic event in a consistent global recovery line can be determinately replayed.

Case2: ω = 1 - If ω is equal to 1, there is one message, mψ+1. When process pi replays its last state interval in

a consistent global recovery line, it knows that it must two messages mψ and mψ+1 to process pj before it

delivers any messages in recovery time because of ευpi

(pj) = 2. Thus, Υ guarantees that all non-deterministic

event in a consistent global recovery line are must be determinately replayed, although there is Ω Line

Problem.

Case3: ω = k, (k > 2) - If ω is equal to k > 2, there is k-th messages, from mψ to mψ+k. When process pi replays

its last state interval in a consistent global recovery line, it knows that it must k-th messages from mψ and

mψ+k to process pj before it delivers any messages in recovery time because of ευpi

(pj) = k + 1. Thus, Υ

guarantees that all non-deterministic event in a consistent global recovery line are must be determinately

17

replayed, although there is Ω Line Problem.

Consequently, Υ always guarantees that all non-deterministic events in a consistent recovery line must be deter-

minately replayed and solves Ω Line Problem in recovery time ¥

5 Conclusion

In this papers, we show that some non-deterministic events in a consistent global recovery line cannot be

determinately replayed in recovery time as a Ω Line Problem. In addition, we propose a new optimistic log-based

rollback recovery protocol and prove that it always guarantees that all non-deterministic events in a consistent

global recovery line must be determinately replayed in recovery time and solves Ω Line Problem.

In future, we will shows that Ω Line Problem is also occurred in a pessimistic log-based rollback recovery

protocol and implement proposed new optimistic log-based rollback recovery protocol in a widely distributed

systems, KOREA@Home Project.

References

[1] E. N. Elnozahy, D. B. Johnson and Y. M. Wang, ”A Survey of Rollbalck-Recovery Protocols in Message Passing

Systems,” CMU Technical Report CMU-CS-99-148, June. 1999.

[2] MaengSoon Baik , JinGon Shon , Kibom Kim , JinHo Ahn , ChongSun Hwang, ”An Efficient Coordinated heckpointing

Scheme Based on PWD Model”, Lecture Notes in Computer Science, Vol 2344, pp 780-790, 2002.

[3] E. N. Elnozahy and W. Zwaenepoel, ”On the use and implementation of message logging,” In Proceedings of the 24-th

International Symposium on Fault-Tolerant Computing, pp. 298-307, Jun. 1994.

[4] R. E. Strom and S. Yemini, ”Optimistic recovery in distributed systems,” ACM Trans. on Computer Systemss, vol. 3,

No. 3, pp. 204-226, Aug. 1985.

[5] L. Alvisi, ”Understanding the message logging paradigm for mssking process crashes,” Ph. D. Thesis, Department of

Computer Science, Comell University, Jan. 1996.

[6] L. Alvisi and K. Marzullo, ”Message Logging: Pessimistic, Optimistic, Causal and Optimal,” IEEE Trans. on Software

Engineering, vol. 24, pp. 149-159, Feb. 1998.

18

[7] S. Rao, L. Alvisi and H. M. Vin, ”The cost of recovery in message logging protocols,” In Proceedings of the 17-th IEEE

Symposium on Reliable Distributed Systems(SRDS), pp. 10-18, 1998.

[8] D. B. Johnson and W. Zwaenepoel, ”Recovery in distributed systems using optimistic message logging and checkpoint-

ing,” Journal of Algorithms, vol. 11, pp. 462-491, Sept. 1990.

[9] E. N. Elnozahy, D. B. Johnson and W. Zwaenepoel, ”The Performance of Consistent Checkpointing,” Proc. 11th Symp.

on Reliable Distributed Systems, pp. 39-47, Oct. 1992.

[10] D. B. Johnson and W. Zwaenepoel, ”Sender-based message logging,” In Proceedings of the 17th International Symposium

on Fault=Tolerant Computing, pp. 14-19, June. 1987.

[11] D. B. Johnson, ”Distributed system fault tolerance using message logging and checkpointing,” Ph. D. Thesis, Rice

University, Dec. 1989.

[12] K. M. Chandy and L. Lamport, ”Distributed snapshots: Determining global states of distributed systems,” ACM

Trans. on Computer Systems, vol. 3, No. 1, pp. 63-75, Feb. 1985.

[13] R. Koo and S. Toueg, ”Checkpointing and rollback-recovery for distributed systems,” IEEE Trans. on Software Engi-

neering, vol. SE-13, No. 1, pp. 23-31, Jan. 1987.

[14] G. Cao and M. Singhal, ”On Coordinated Checkpointing in Distributed Systems,” IEEE Trans. on Parallel and Dis-

tributed Systems, vol. 9, No. 12, pp. 1213-1225, Dec. 1998.

[15] H. J. Michel, R. H. Netzer and M. Raynal, ”Consistency Issues in Distributed Checkpoints,” IEEE Trans. on Software

Engineering, vol. 25, No. 2, pp. 274-281, March/April. 1999.

[16] Nitin H. Vaidya, ”Staggered Consistent Checkpointing,” IEEE Trans. on Parallel and Distributed Systems, vol. 10, No.

7, pp. 694-702, July. 1999.

[17] J. Fowler and W. Zwaenepoel, ”Causal Distributed Breakpoints,” Proc. Int’l Conf. Distributed Computing Systems,

pp. 134-141, May. 1990.

[18] Y. M. Wang, Y. Huang and W. K. Fuchs, ”Progressive retry for software failure recovery in message passing applica-

tions,” IEEE Trans. on Computers, Vol. 46(10), pp 1137-1141, Oct. 1997.

19

A Lightweight P2P Extension to ICP

Ping-Jer Yeh1, Yu-Chen Chuang2, and Shyan-Ming Yuan1

1 Department of Computer and Information Science, National Chiao Tung University,

Hsinchu, 300 Taiwan pjyeh, [email protected]

2 FAB2 FA Department, Macronix International Co., Ltd,

Hsinchu 300, Taiwan [email protected]

Abstract. Traditional Web cache servers based on HTTP and ICP infrastructure

tend to have higher hardware and management cost, have difficulty in failover

service, lack automatic and dynamic reconfiguration, and may have slow links

to some users. We find that the peer-to-peer architecture can help solve these

problems. The peer cache service (PCS) leverages each peer’s local cache,

similar access patterns, fully distributed coordination, and fast communication

channels to enhance response time, scale of cacheable objects, and availability.

Incorporating strategies such as making the protocol lightweight and backward

compatible with existing cache infrastructure, undertaking dynamic three-level

caching, exchanging cache information, and supporting mobile devices further

improve the effectiveness.

Keywords: peer-to-peer, ICP, cache, mobile devices, distributed.

1 Introduction

There are several problems with traditional Web cache service based on HTTP

(Hypertext Transfer Protocol) and ICP (Internet Cache Protocol) infrastructure. This

section identifies what we are trying to solve, and then outlines the goals we are trying

to achieve.

1.1 Problems of Traditional Web Cache Service

Traditional Web cache service follows a client-server architecture. The association

between a Web client and a number of potential cache servers can be mandatory (i.e.,

a single pre-defined proxy server handles all URL requests) or configurable by proxy

auto-config mechanism [1] (i.e., different proxies handle different regions of URL

requests). On the other hand, the group of cache servers can be organized in two

cooperative ways: parent-child hierarchy or sibling mesh [2], and they communicate

by inter-cache protocols such as ICP [3][4].

There are several problems with this architecture, however. First, to improve

locality, an organization tends to setup a child cache server locally under the

assumption that users within the same organization have more chance to generate

recurrent usage patterns. But a problem then arises: cache servers tend to be the

bottleneck in both CPU load and storage space, bringing in higher hardware and

management cost for small offices or organizations.

The second problem is that cache servers tend to be the single point of failure. The

issue of failover can be investigated from two angles: the server part and the client

part. From the server’s standpoint, cluster technology could help, but it requires

additional system-level supports and management skills. From the client’s standpoint,

a more dynamic proxy selection scheme could help. As for current proxy auto-config

mechanism, however, the list of proxy candidates that a Web client can consult is

fixed in advance unless the setting is explicitly re-configured and reloaded again and

again. Therefore, it cannot provide the failover feature gracefully.

The lack of dynamic proxy selection scheme also raises another problem in a

mobile environment. When a mobile device (such as notebooks, PDAs, and cell

phones) roams to a different place, it has to re-configure its proxy setting accordingly.

Such a priori knowledge may be unknown, however, since neither HTTP nor ICP can

dynamically query for proxies, like DHCP does for DNS servers.

The last but not the least problem is that the connection to accessible cache servers

may be too slow. A prevailing approach is to setup a new proxy within the

organization, but it introduces the first 2 problems as mentioned before.

1.2 Inspiration and Design Goals

Peer-to-peer (P2P) technology has potential for solving these problems. The basic

idea is to leverage local cache in each Web client to form a common cache space

suitable for small offices or organization scale. In this paradigm, each Web client acts

as a peer that shares its local storage with other peers, forming a larger cache space. In

this way, hardware cost is reduced since both CPU load and storage space are fully

distributed among peers. Management cost for computers and mobile devices is also

reduced since no a priori knowledge about topology of cache service is required.

Availability can be improved since the responsibility for cache service is also fully

distributed. Connection of cache service can also be improved if most peers are in the

same subnet.

P2P introduces new challenges, however. First, no central coordination server

exists in a fully distributed P2P environment. Everything is spontaneous, and

collaboration among peers is harder. If done poorly, traffic on the subnet may increase

and performance may decrease. Second, the effectiveness of cache service relies

heavily on cache information stored in peers. The total number of cacheable objects

may drop down if peers become too homogeneous or isolated. Third, topology and

boundary of peers may be unpredictable, so does the query time for cacheable objects.

If the range of peers become too large, the overall performance may be worse than

traditional servers. Fourth, there has been much investment in traditional cache system.

How can the new P2P cache system coexist or even cooperate with traditional ones?

Existing P2P cache projects, with various focus and goals, have not addressed all

these issues. To name a few, Squirrel [5] focuses on locating and routing, Tuxedu [6]

focuses on transcoding and edge services, but they are incompatible with existing ICP

infrastructure. BuddyWeb [7] requires a central server (called the LIGLO server) and

agent-based infrastructure for registration and tracking. Other systems, such as

FlashSurfer [8], seem to ignore these issues at all.

Therefore, additional requirements have to be fulfilled in order to adopt the P2P

approach to Web cache service. First, the protocol should be lightweight in both

packet size and communication frequency. Second, cache information in each peer

should be exchanged effectively so as to increase the number of cacheable objects and

hit ratio. Third, the peers’ boundary should be limited to a reasonable size. Fourth, the

architecture should be as compatible as possible with current Web infrastructure built

upon HTTP and ICP, so as to leverage existing resource.

The remaining paper is organized as follows. Section 2 outlines global design

strategies for caching. Section 3 discusses the system architecture and protocol.

Section 4 discusses implementation issues, and Section 5 gives a conclusion.

2 PCS Design Strategies

This section gives an overview of environment settings of the peer cache service

(PCS), and then outlines global strategic decisions for caching, topology, and protocol

design.

2.1 Roles and Environment Setting

Our PCS is designed for small offices or organization scale. In such an environment,

each computer acts as a PCS peer, sharing its own local storage with others to form a

common cache space. The roles of peers can be classified according to their relative

location. Take Fig. 1 for example, if we are using the computer A, then

• The local peer (LP) is A.

• Neighbor peers (NPs) are peers within the same subnet, i.e., B and C.

Communication between the LP and NPs is therefore fast since they reside in the

same broadcast boundary.

• Friend peers (FPs): peers outside of the subnet.

It should be noticed that the division of roles is not mandatory but relative. If the focal

peer under consideration is B, for example, then B is the LP. Now A and C are all B’s

NPs. Note also that traditional ICP-based cache servers can co-exist in the PCS

environment, too.

In each peer there runs a PCS server, which is responsible for PCS management

and communication issues. A PCS server can serve multiple browsers and users in the

same computer.

2.2 Lightweight Protocol

A protocol is needed in the new PCS environment to query cacheable objects and to

exchange cache information among peers, in contrast to traditional Web cache service.

The protocol should be lightweight so as to reduce traffic on the subnet. It should also

be as compatible as possible with traditional cache protocols so as to leverage existing

cache infrastructure.

It will be useful to make a distinction between two kinds of communication: the

communication between Web browsers and the PCS server in the same computer, and

the communication among different peers in different computers. The former should

remain HTTP so as to eliminate the need for modifying existing Web browsers. As for

the latter, we design the ICPoP (ICP over P2P) as a backward compatible extension to

ICP for several reasons. The first reason is that ICP is the most popular inter-cache

protocol. Maintaining backward compatibility allows us to leverage a large number of

resources. The second reason is that it is relatively lightweight compared with other

inter-cache protocols. An ICP message has a 20-byte header a variable sized payload.

The payload is usually an URL, whose average length is 55 bytes. It is therefore more

lightweight than other protocols (e.g., the length of an HTCP message excluding the

payload is 540 bytes [9]). In addition, ICP messages are sent over UDP, requiring less

resource.

2.3 Support for Mobile Devices

The support for mobile devices in PCS is twofold. First, mobile devices should be

subnet1

subnet2

subnet3

PCS server

browser1

browser3

browser2

Peer

NP1

NP2

LP

FP1

FP2

FP3

FP4

LP: local peer

NP: neighbor peer

FP: friend peer

A

CB

ICP

server

Fig. 1. A typical PCS environment

able to operate free from configuration since everything is spontaneously configured

in the PCS environment. It is such plug-and-play operation that facilitates roaming.

Second, the communication should be infrequent so as to reduce power consumption.

As a consequence, we do not adopt complex routing strategies as Squirrel [5] and

Tuxedo [6] have done. To eliminate the need to discover peers and cacheable objects

by trial and error, ICPoP provides a broadcast mode so that a mobile device needs

only issue a single query, and then some friendly NPs will stand out spontaneously.

2.4 Dynamic Three-Level Caching

To limit the boundary of peers in a fully distributed P2P environment, we adopt a

three-level caching scheme as shown in Fig. 2. The level-0 peer is the LP itself. Level-

1 peers are all NPs within the same subnet. Level-2 peers are FPs outside the subnet

boundary. We do not consider more levels because it is usually more efficient to

connect directly to origin servers than to consult level-3 or higher peers.

When the LP wants to query for a cacheable object in the PCS environment, it

broadcasts a level-1 ICPoP query to all NPs. If an NP itself cannot find a cache hit, it

replies with a list of FP candidates that may have a cache hit, so that the LP has a

second chance to perform a level-2 caching. The list of level-2 FPs may be

recommended according to various criteria such as latency and expiration state.

Therefore, the three-level cache tree is dynamic and usually unbalanced; level-0 and

level-1 nodes are usually fixed unless some computers ever start or shutdown in the

middle, while level-2 nodes are usually changing.

2.5 Exchange for Cache Information

The effectiveness of cache service relies heavily on cache information stored in peers.

If cache information is too homogeneous or isolated, the space of cacheable objects is

too restricted. If too heterogeneous, however, locality is too low to benefit from

caching. A balance is therefore required.

level 0 level 1 level 2

NP1

LP

NP2

NP3

FP1

FP2

FP3

subnet

Fig. 2. Dynamic three-level peer caching

Peers within the same subnet tend to have recurrent usage patterns, and they can

communicate by broadcasting. Therefore the LP and NPs do not need to exchange

cache information massively and frequently. On the other hand, the LP and FPs do

need to exchange cache information so as to increase diversity and thus the number of

known level-2 peer candidates.

We therefore design a strategy to exchange cache information by peer groups. As

Fig. 3 shows, in a subnet, each peer tries to configure itself with a group ID as

different as possible from other NPs. Then each peer with the same group ID

exchanges cache information periodically by multicasting. In this way we have a

certain degree of diversity and do not bring in too much network traffic.

3 Architecture and Protocol

The overall architecture of our PCS implementation is shown in Fig. 4. The rectangles

inside the box of PCS server are high-level components. The letters (a)-(h) are control

flows between components. The Roman numerals (i)-(x) are network messages.

Broken lines represent HTTP messages, while thick lines represent ICPoP messages.

This section discusses each component and protocol in detail. To save space, letters

(a)-(h) and Roman numerals (i)-(x) in this section all refer implicitly to those in Fig. 4

unless mentioned explicitly.

3.1 HTTP Manager

The HTTP manager is responsible for handling HTTP requests and replies related to

browsers in LP and PCS servers in NPs or FPs. Mobile devices are also welcome. On

receiving an HTTP request (i), it performs the following steps in sequence:

1. It asks the cache manager via (a) to see whether the requested object is in the local

cache.

subnet1

subnet2

subnet3

group ID: 1234

Fig. 3. Grouping strategy in the PCS environment

2. Otherwise, it asks the P2P manager via (b) to consult NPs and FPs to see who has

the requested object.

3. Otherwise, it connects directly to origin servers or traditional cache servers by

HTTP.

Finally, it asks the cache manager via (a) to cache the object and then replies the

object (ii). If the HTTP manager finds via (b) that some NPs or FPs may have the

requested object, it sends request messages (iii) to them and waits for their replies (iv).

3.2 ICPoP Manager

The ICPoP manager is responsible for handling ICPoP requests and replies related to

NPs or FPs. Three kinds of ICPoP messages are involved here: query for cacheable

objects, exchange for cache information, and query for mobility support.

Web browser or

other PCS server

other PCS server

HTTP manager

cache

manager

P2P

manager

local

cache

ICPoP manager

other PCS server

other PCS server

resource

manager

resource

table

(a) (b)

(c)

(d) (e)

(f)

(g)

(h)HTTP message:

ICPoP message:

(i)

(ii)

(iii)

(iv)

(v)

(vi)

(vii)

(viii)

PCS server internals

Web(ix)

(x)

Control flow:

Fig. 4. Architecture of a PCS server

Query for Cacheable Objects. On receiving an ICPoP request (v), the ICPoP

manager performs the following steps in sequence:

1. It asks the cache manager via (d) to see whether the requested object is in the local

cache.

2. Otherwise, it asks the resource manager via (f) to recommend a list of FP

candidates that may have the requested object.

Finally, it replies a message (vi) with the cache hints.

If the ICPoP manager is asked by the P2P manager via (e) to consult whether level-

1 NPs have the requested object, it broadcasts request messages (vii) to level-1 NPs

and waits for their replies (viii). Finally, it replies to the P2P manager via (e) a list of

NPs and level-2 FP candidates.

If the ICPoP manager is asked by the P2P manager via (e) to consult whether a list

of given level-2 FP candidates have the requested object, it unicasts request messages

(vii) to them and waits for their replies (viii). Finally, it replies to the P2P manager via

(e) a list of FPs having cache hits.

Exchange for Cache Information. The ICPoP manager periodically exchanges cache

information with peers with the same group ID. In the beginning it broadcasts a

message (vii) to all NPs in the same subnet asking for their group IDs. And then it

selects from replies (viii) a new or least duplicate group ID for itself.

If the ICPoP manager is about to share its cache information, it multicasts a

message (vii) to peers with the same group ID. Then it forwards the replies (viii) to

the resource manager via (f).

On receiving an ICPoP request for sharing (v), the ICPoP manager asks both the

cache manager via (d) and the resource manager via (f) to recommend a number of

cache information. Finally, it replies a message (vi) with such information.

Query for Mobility Support. If the LP is a mobile device and wants to find someone

kindly enough to provide a cache service when roaming to a new place, the ICPoP

manager broadcasts a query message (vii) to all NPs. Hopefully there are some nice

peers showing their generosity in the reply messages (viii). Then the LP can send

HTTP messages (i) to one of these nice peers.

On receiving an ICPoP request for mobility support (v), the ICPoP manager may

reply with a positive or negative answer (vi).

3.3 P2P Manager

The roles of the P2P manager are twofold. It tries to fulfill the HTTP manager’s

asking via (b) for remotely cached objects, and then asks the ICPoP manager via (e) to

perform three-level caching.

On receiving the HTTP manager’s request, the P2P manager first tries to ask the

ICPoP manager to perform level-1 caching. If no cache hit is found, it asks further to

perform level-2 caching. In this way the P2P manager maintains a dynamic (usually

unbalanced) tree of NPs and FPs.

3.4 Cache Manager

The roles of the cache manager are twofold. It tries to fulfill the HTTP manager’s

asking via (a) for retrieving or storing locally cached objects, and to fulfill the ICPoP

manager’s asking via (d) for the retrieval of them.

3.5 Resource Manager

The resource manager is responsible for maintaining dynamically exchanged cache

information. It is invoked by the ICPoP manager via (f). The format of the resource

table is listed in Table 1.

3.6 ICPoP

The ICPoP (ICP over P2P) is a backward compatible extension to ICP. The opcodes

supported are listed in Table 2. Briefly speaking, the extension is twofold. Some

opcodes have payloads carrying additional data. In addition, some opcodes are added

to perform three-level caching and to exchange cache information.

ICP_QUERY (1): Query for cacheable objects. It is broadcasted to level-1 NPs or

unicasted to level-2 FPs by the ICPoP manager (vii). Possible replies (viii) are:

• ICP_HIT (2): the payload may carry additional cache information for exchange.

Table 1. An example of entries in the resource table

No Resource Peer’s address Time

1 http://www.nctu.edu.tw/img/txt.jpg 140.113.23.111 20020328 140000

2 http://www.nctu.edu.tw/img/openhouse.jpg 140.113.23.204 20020329 150252

Table 2. Opcodes supported in ICPoP. Note that the names

retain an “ICP_” prefix for compatibility purpose

Type Opcode name Value Description Comment

Query ICP_QUERY 1 Query for cacheable objects Unmodified

Reply ICP_HIT 2 Cache exists Extended

Reply ICP_MISS 3 Cache does not exist Extended

Query ICP_GROUP_QUERY 25 Query for group ID New

Query ICP_RT_QUERY 26 Query for resource table New

Query ICP_RELAY_QUERY 27 Query for mobility support New

Reply ICP_GROUP_REPLY 28 Reply for group ID New

Reply ICP_RT_REPLY 29 Reply for resource table New

Reply ICP_RELAY_REPLY 30 Reply for mobility support New

ICP_QUERY (1): Query for cacheable objects. It is broadcasted to level-1 NPs or

unicasted to level-2 FPs by the ICPoP manager (vii). Possible replies (viii) are:

• ICP_HIT (2): the payload may carry additional cache information for exchange.

• ICP_MISS (3): the payload may carry a list of level-2 FP candidates.

ICP_GROUP_QUERY (25): Query for NP’s group ID. It is broadcasted to NPs in

the same subnet by the ICPoP manager (vii). The reply (viii) is:

• ICP_GROUP_REPLY (28): the payload carries another NP’s group ID.

ICP_RT_QUERY (26): Query for entries in the resource table. It is multicasted to

the peers with the same group ID by the ICPoP manager (vii). The reply (viii) is:

• ICP_RT_REPLY (29): the payload carries a number of cache information held by

another peer.

ICP_RELAY_QUERY (27): Query for mobility support. It is broadcasted to the

NPs in the same subnet by the ICPoP manager (vii). The reply (viii) is:

• ICP_RELAY_REPLY (30): the payload carries a positive or negative answer.

4 Implementation Issues and Conclusion

We implement a pilot PCS server on the basis of the Java HTTP Proxy Server [10],

which is simple enough to experiment our ideas. All major components in the PCS

server are implemented in worker thread pools.

The performance of PCS is sensitive to various factors. In our early experience, the

following settings perform well in a small office:

• We find that a PCS caching scheme of level-3 or higher is unnecessary, if not

worse. If there seems a need for higher levels, a traditional cache server in a

backbone is usually a better choice.

• The maximum number of group IDs is 8.

• An ICP_HIT message may carry as many as 10 entries of cache information from

the cache manager and the resource manager, respectively.

• An ICP_MISS message may carry as many as 10 entries of level-2 FP candidates.

• An ICP_RT_REPLY message may carry as many as 50 entries of cache

information from the cache manager and the resource manager, respectively.

• The resource table may carry as many as 10,000 entries.

They are not optimal, however. A thorough evaluation is necessary to discover an

empirically satisfactory result.

5 Conclusion

This paper has presented a new approach to Web cache service. The peer cache

service (PCS) solves the problems in traditional client-server caching architecture

such as hardware and management cost, mobility, and speed. Incorporating strategies

such as making the protocol lightweight and backward compatible with existing cache

infrastructure, undertaking dynamic three-level caching, exchanging cache

information, and supporting mobile devices further improve the effectiveness.

The peer cache service is designed for users in small offices or organization. We

are inquiring into some empirical issues to try to enhance its performance and scale.

Acknowledgments

This work was partially supported by Ministry of Education grant 89-E-FA04-1-4:

high confidence information systems, and by National Science Council grant NSC91-

2213-E009-044: system software on handheld devices.

References

1. Netscape Communications Corporation, “Navigator Proxy Auto-Config File Format”,

Technical Report. URL: http://wp.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-

live.html (1996)

2. S. Dykes and K. Robbins, “A Viability Analysis of Cooperative Proxy Caching”,

Proceedings of the 21th Annual Joint Conference of the IEEE Computer and

Communications Societies (INFOCOM 2001), vol. 3, pp. 1205-1214, Anchorage, Alaska,

USA, April 2001.

3. D. Wessels and K. Claffy, “Internet Cache Protocol (ICP), Version 2”, RFC 2186, Network

Working Group, The Internet Society, September 1997.

4. D. Wessels and K. Claffy, “Application of Internet Cache Protocol (ICP), Version 2”, RFC

2187, Network Working Group, The Internet Society, September 1997.

5. S. Iyer, A. Rowstron, and P. Druschel, “Squirrel: A Decentralized Peer-to-Peer Web

Cache”, Proceedings of the 21th ACM Symposium on Principles of Distributed Computing

(PODC 2002), pp. 213-222, Monterey, California, July 2002.

6. W. Shi, K. Shah, Y. Mao, and V. Chaudhary, “Tuxedo: A Peer-to-Peer Caching System”,

Proceedings of 2003 International Conference on Parallel and Distributed Processing

Techniques and Applications (PDPTA’03), Las Vegas, Nevada, June 2003.

7. X. Wang, W. Ng, B. Ooi, K-L. Tan, and A. Zhou, “BuddyWeb: A P2P-based Collaborative

Web Caching System”, Peer to Peer Computing Workshop (Networking 2002), Pisa, Italy,

May 2002.

8. APA Network Technologies Ltd, FlashSurfer. URL: http://www.aranetwork.com/prod_

flashsurfer.html

9. P. Vixie and D. Wessels, “Hyper Text Caching Protocol (HTCP/0.0),” RFC 2756, Network

Working Group, The Internet Society, 2000.

10. S. Hsueh, Java HTTP Proxy Server, Version 1.0. URL: http://www.geocities.com/

SiliconValley/Bay/8925/jproxy.html (1998)

1

Comparative Analysis of the Prototype and

the Simulator of a New Load Balancing

Algorithm for Heterogeneous Computing

Environments

RODRIGO F. DE MELLO1, LUIS C. TREVELIN

2, MARIA STELA V. DE PAIVA1

1 Engineering School of São Carlos, Department of Electrical Engineering, University of São Paulo

R. Carlos Botelho, São Carlos, Brazil

[email protected], [email protected] 2 Computing Department, Federal University of São Carlos

Rod. Washington Luís, Km 235, São Carlos, Brazil

[email protected]

Abstract - The availability of low cost hardware has allowed the development of distributed

systems. The allocation of resources in these systems may be optimized through the use of a load

balancing algorithm. The load balancing algorithms make the occupation of a certain environment

homogeneous, with the aim of obtaining gains on the final performance. This paper presents and

analyzes a new load balancing algorithm that is based on the tree format hierarchy of computers.

This analysis is made through the results obtained by a simulator and a prototype of this algorithm.

The algorithm simulations show a significant performance gain, by lowering the response times

and the number of messages that pass through the communication means to perform the load

balancing operations. After the good results were obtained on the simulations a prototype was built

and it confirmed such results.

Keywords - load balancing algorithm, heterogeneous environments, high

performance, simulation, and prototype.

I. Introduction

The availability of low cost microprocessors and the evolution of the computing

networks have made feasible the construction of distributed systems. On such

systems, the processes are executed on the network computers, which

communicate to one another to perform the same computing operation. In order

to distribute the processes among these systems, a load balancing algorithm may

be adopted.

2

The load balancing algorithms try to equally distribute the load process among the

computers that make part of the environment. Krueger and Livny [1] demonstrate

that the load balancing methods reduce the average and standard deviation of the

processes response times. Shorter response times are desirable, as they show

higher performance on the processes execution.

The load balancing algorithms involve four policies: transference, selection,

location and information [2, 3]. The transference policy classifies a computer as

the emitter or receptor of the processes based on its availabilit y status. The

selection policy defines the process that should be transferred from the busiest

computer to the idlest one. The location policy defines which computer should

receive or send a transference process. A serving computer offers the processes

when it is overloaded, a receiving computer requests processes when it is idle.

The information policy defines when and how the information on the computers’

availabilit y is updated on the system.

Several load balancing methods have been proposed [2, 4, 5] According to

Shivarathi et al [2], these algorithms can be subdivided into: server-initiated,

receiver-initiated, symmetric initiated, stable server-initiated and stable

symmetrically initiated. Out of these algorithms, the one that shows the highest

performance is the stable symmetrically initiated.

Despite of the recent works in the load balancing area [11-22], the studies and

algorithms from Krueger and Livny[1], Ferrari and Zhou [4], Theimer and Lantz

[5] and Shivaratri, Krueger and Singhal [2] have been considered the main works

for this area. The recent works don’ t bring significant contributions to the

processes’ performance and reduction of the messages number on the computers

network. The work presented in this paper, as the recent works in the area, is

based on the main studies of the load balancing area.

Later on, Mello, Paiva and Trevelin, during others studies on clusters and

distributed systems [6, 7, 8, 9], observed and proposed a new method for load

balancing. Through these studies, Mello, Paiva and Trevelin [10] concluded that

by choosing a load index based on the process behavior and on the computer

capacity at the environment it would be possible to obtain better results for the

server-initiated algorithms. On these studies, the authors show a comparative

study between the Lowest algorithm, which uses a new load index that is based on

the process behavior and on the computers’ capacity and the Stable Symmetrically

3

Initiated algorithm. According to the obtained results, the modified Lowest

algorithm shows the highest performance. These analyses have been based on the

studies of Shivaratri et al [2], and confirm that the adequate load index choice

modifies the behavior of the load balancing algorithm.

Vital aspects for the success of a load balancing algorithm are: stabilit y on the

number of messages generated at the environment in order to not overload it;

support to environments composed by heterogeneous computers and processes;

low cost update on the environment load information; low cost choice for the

ideal computers to execute a process received by the system; stabilit y on high

load; system scalabilit y, and short response times. The study and analysis of each

above mentioned aspect generated a new load balancing algorithm named TLBA

(Tree Load Balancing Algorithm).

This paper presents the TLBA, a new load balancing algorithm addressed to

environments where are desirable: high scalabilit y; support to heterogeneous

processes and computers; decrease in the number of messages that pass through

the communication means; stabilit y on high loads, and short response times. A

comparative study between the TLBA and the modified Lowest algorithm has

been proposed by Mello, Paiva and Trevelin [10] by using simulators. The TLBA

is compared to modified Lowest algorithm because the last presented better

results than Stable symmetrically initiated. After simulating and proving the

performance gains of TLBA, a prototype has been implemented. This prototype

was submitted to tests and the results confirmed the values obtained through the

simulations.

The paper has been divided into the following sections: section II presents the

State of Art; section III presents the Tree Load Balancing algorithm; section IV

presents the comparative studies based on the simulations; section V presents the

TLBA prototype and compares it to the results obtained on the simulations;

section VI presents the conclusions; and, finally, the references.

II. The State of the Art

Among the main studies on load balancing, we may highlight the ones by Zhou

and Ferrari [4], Theimer and Lantz [5], Shivaratri, Krueger and Singhal [2] and

Mello, Paiva and Trevelin [10]. Other recent works were proposed [11-22],

however they don’ t offer significant performance contributions. These recent

4

works are also based on the studies of Zhou and Ferrari [4], Theimer and Lantz

[5] and Shivaratri, Krueger and Singhal [2].

Zhou and Ferrari [4] evaluate four server-initiated load balancing algorithms, i.e.

the most overloaded computer. These algorithms are: Disted, Global, Central,

Random and Lowest. On Disted, when a computer realizes any alteration on its

load, it emits messages to the other computers to inform them about its current

load. On Global, there is a computer that centralizes all the computers’ load

information on that environment and this centralizing computer sends broadcasts

to the environment in order to keep the other ones updated about the situation. On

Central, as on Global, a central computer receives all the load information of the

system, however, it does not update the other computers about it. This centralizing

computer decides about the resources allocation at the environment. On Random,

no information about the environment load is handled. On this algorithm, a

computer is selected by random in order to receive a process to be initiated. On

Lowest, the load information is sent when demanded. When a computer starts a

process it requests information and analyzes the loads of a small set of computers

and submit the processes to the idlest one. The idlest computer has the shortest

processes’ queue.

When Zhou and Ferrari analyzed the Disted, Global, Central, Random and Lowest

algorithms, they reached the conclusion that all of them show a performance

higher than the one presented by a system with no load balancing at all . However,

the Lowest algorithm showed the highest performance of all . This algorithm

generates demand messages to transfer the process, causes a lower overload on the

communication means and, thus, allows the incremental increase of the

environment. The number of generated messages on an environment does not

depend on the number of computers.

Theimer and Lantz [5] have implemented algorithms similar to the Central, Disted

and Lowest ones. Nevertheless, they have analyzed such algorithms at systems

composed of a larger number of computers (about 70). For the Disted and Lowest

ones, some process reception and emission groups were created. The

communication within these groups was made by using a multicast protocol, in

order to minimize the messages exchange among the computers. Computers with

a lower than limited load have participated in the process reception group whilst

5

the computers with a higher than limited load have participated in the process

emission group.

Theimer and Lantz recommend decentralized algorithms such as Lowest and

Disted, as they do not generate single points of fault as Central does. Central

presents the highest performance for small and medium size networks but it

degrades on environments composed by too many computers.

Through the analyses, Theimer and Lantz concluded that algorithms such as

Lowest work with the probabilit y of a computer being idle [5]. Thus, they assume

the system homogeneity as they use the extension of the CPU’s waiting queue as

load index. The process behavior is not analyzed; therefore, the actual load of

each computer is not measured.

Shivaratri, Krueger and Singhal [2] analyze the algorithms server-initiated,

receiver-initiated, symmetrically initiated, adaptative symmetrically initiated and

stable symmetrically initiated. On these studies, the length of the process waiting

queue at the CPU was considered as the load index. This measure has been chosen

because it is simple and, therefore, consumes fewer resources to be obtained.

Shivaratri, Krueger and Singhal have concluded that the receiver-initiated

algorithms present a higher performance than the server-initiated ones, such as

Lowest. However, on their conclusions, the algorithm that showed the highest

final performance was the stable symmetrically initiated one. This algorithm

preserves the history of the load information exchanged in the system and take

actions to transfer the processes by using this historic information.

Mello, Paiva and Trevelin [10] analyze a load index based on the process behavior

as well as on the system computers capacity. The authors propose a comparison

between the Stable Symmetrically Initiated algorithm by Shivaratri et al [2] and

the modified Lowest algorithm. The modified Lowest uses an analysis of the

process behavior as well as the system computers capacity as load index. Through

these studies, it was concluded that, by using a new load index, the Lowest

algorithm provides better response times. These results were reached through the

simulation of homogeneous and heterogeneous environments.

6

III. The Tree Load Balancing Algorithm: A New

Model

The TLBA algorithm, named Tree Load Balancing Algorithm, creates a virtual

interconnecting tree among the computers of the system. On this tree, each

computer of an N level sends its updated load information to the N-1 level

computers. The lower is the level the closer it is to the root. The root is located on

level zero.

The selection of the best computer to execute a process received by the system

works as a deep search on the interconnecting tree. Thus, an N level may define

which of its branches located on an N+1 level presents the lowest occupation, by

re-passing a search message that is propagated through all the branches of the tree

until it finds the idlest computer.

The tree may be defined as a non-cyclic connected graph. Being G = (X, U) a

graph with n > 2, the following properties are equivalent to characterize G as a

tree:

1) G is connected and non-cyclic;

2) G is non-cyclic and has n-1 edges;

3) G is connected and has n – 1 edges;

4) G is non-cyclic and, by adding one edged, only one cycle is created;

5) G is connected, but stops being so if one edge is suppressed;

Every G’s pair of vertices is connected by only one simple chain.

After defining the tree, it is necessary to define the levels that make it up. Being

an oriented graph G = (X, R-), which does not presents any circuits, a partition of

N of X may be defined as:

N = N0, N1, ..., Nr

Where the elements N0, N1, ..., Nr, called as levels, are arranged by the inclusion

of their R-1 as follows:

N0 = xi / R-1 (xi) =

N1 = xi / R-1 (xi) C N0

N2 = xi / R-1 (xi) C (N0 U N1)

Nr = xi / R-1 (xi) C U (de k=0 a r-1) Nk

7

Therefore, the vertices of a level are only anteceded by the previous levels, what,

evidently, is only possible if the graph does not have circuits.

The last level will verify R+ (Nr) = , if there are no more successors on the

graph.

A) Update of the Information Used to Find the Best

Computer in the System

On TLBA, each computer of the system has an updated table with the occupation

information. On this table, there are included its information as well as the one

related to the other computers that make part of its subtree. The subtree of a α

computer, located on the N level, is given by all the computers from the N+1

level, which succeed to a computer through an oriented arch. The α computer

participates in this subtree, being defined as its father.

Figure 1 – Subtree of a α computer

The computers from level N+1 send information about the occupation to the

antecedent computer located on the N level. The father computer generates a

single number that represents its occupation and the one of its subtree. To make

this calculation, the father computer sums up the occupation related to the

capacities of each computer of its subtree, generating a single load value for the

subtree controlled by it. The load value is propagated until the root node.

An N+1 level computer only sends its load information when it exceeds a certain

threshold percentage value. Assuming a threshold of 5%, the current load is

compared to the one taken at the latest reading; if the current load is 5% above or

below the previous load, then the load information is sent to the father computer,

which is located on the N level. The definition of this threshold allows the

decrease in the number of messages related to the information update on the

system. By decreasing the overload on the communication means the stabilit y

increases [2].

α

β χ

Level

0

Level 1

8

B) Load Index Calculation

The load index used by TLBA is based on the analysis of the CPU and memory

resources consumed by the processes. The reading of this resources occupation is

given as:

O(i, r, t) = Nread(i, r, t)

where:

i is the computer’s identifier

r is the computing resource (CPU and memory)

t is the reading instant

O is the occupation of the read resource

Nread is the occupation reading function

After the occupation reading of each resource is done, it is calculated the load

index that represents the load of the analyzed computer, given by the equation:

Ncarga(i, t) = (O(i, CPU, t)/CAPcpu) + (O(i, MEMO, t)/CAPmemo)

where:

i is the computer’s identifier

t is the occupation reading instant

CPU and MEMO are the resources

O(i, CPU, t) is given by the sum of the CPU’s occupation by each process of the

system at a certain interval

O(i, MEMO, t) is given by the sum of the memory occupation by each process of

the system at a certain interval

CAPcpu is the computer CPU’s capacity at a certain interval

CAPmemo is the computer memory capacity at a certain interval

This load index considers the CPU and memory occupations of each computer of

the system. The occupation of each resource is divided by its total capacity,

generating an occupation percentage. By summing up the CPU and memory

occupations it is generated a single index, which is handled by the whole system.

9

C) Selecting the Best Computer of the System

The selection of the best computer to execute a process received by the system is

made by using the tables contained in each computer of the tree. Each computer

has the updated information related to its subtree. This information is propagated

until it reaches the root computer of the system, located on level zero. A request to

execute the process in the system is always received by the root computer and this

computer starts a deep search through the tree until it finds the best computer of

the environment.

The deep search in the tree is started on the root computer. This computer

compares its load to the ones belonging to the other computers of its subtree. See

that the root computer doesn’ t have the load information of all system computers.

The root computer has only a load value that resumes the load of its subtrees. If

the root computer had all system computers’ load it could send the process

directly to the laziest node. Although this technique would be similar to the

Central algorithm [4], generating many messages, overloading the computers

network. In case the root computer has the lowest load, it executes the process. If

not, it sends a request to the best computer of its subtree, which continues the

same searching process until it finds the idlest computer in the system.

D) Migration and Checkpointing of the Initiated Processes

The TLBA provides support for the checkpointing and migration of processes

initiated in the system. When a computer updates its local load information it also

analyzes the percentage occupation of the system. In case the system load is above

a pre-determined limit it generates an analysis message for the migration of

processes that will follow. This message is sent to the root computer of the

system, which calculates the average occupation of the whole system.

The average occupation of the whole system is calculated by dividing the total

load of the system by the number of computers. The root computer is capable of

making this calculation as it has the updated load information of the whole

system. This value calculated by the root computer is sent back to the computer

that has sent an analysis message for the migration that will succeed.

The computer that sent and analysis message for the succeeding migration

compares its load to the average load of the remainder computers of the system. If

its load is higher than the other ones it starts the checkpointing and requests the

10

root computer to search for the idlest computer os the system (section III .C). After

finding the idlest computer, it submits one of its processes. If its load is lower or

similar to the ones belonging to the remainder computers in the system, it does not

make the checkpointing and migration of processes.

E) Other considerations

The computers participate of the tree through the decisions made by a Broker.

This Broker is a software component, which receives interested computers

requisitions in participating of the tree. The Broker defines where a computer will

be put. The goal of the Broker is create a balanced tree according to the computers

capacity. The computers capacity is measured though a benchmark program, this

program is run before the computer placed in the tree.

The computers of a higher or lower level detect the fail of a computer. The higher

levels identify fails when try to send load information (section III .A). The lower

levels identify fails when select the best computer to run a process (section III .C).

Once a computer detects a fail it sends a message to the Broker, which remount

the tree structure. The Broker can be running on several computers avoiding fails.

If a system administrator doesn’ t want to use the Broker he/she can manually

configure the environment, setting the tree levels and computers per level.

However, manually configurations cannot use the advantages of the fail treatment

offered by the Broker.

Another relevant aspect is that the system administrator defines the number of

nodes per tree level. Several nodes per level can increase the generated messages

between levels, overloading the computers network. Fewer nodes per level

diff icult the deep search to find the best computer to run a process (section III .C).

IV. Simulations

The proposed simulations try to compare the behavior of the TLBA to the

modified Lowest one [10]. The modified Lowest showed a better performance

than the Stable Symmetrically Initiated, which was proposed by Shivaratri et al

[2] as the best solution. The simulations are subdivided into the following groups:

1) Simulations analyzing a system composed of 7 homogeneous and

heterogeneous computers.

11

2) Simulations analyzing a system composed of 781 homogeneous and

heterogeneous computers.

3) Simulations to analyze the number of messages that pass through the

communication network into the TLBA and modified Lowest.

These simulation groups were proposed to permit environment complete analyses;

in this way the TLBA was tested for large and small systems. To complete the

tests the simulations were done using homogeneous and heterogeneous

environments. Then, the analyses cover the main possibilities of an environment.

We used 7 and 781 computers just to create symmetrical trees for the tests. For 7

computers it was created a tree composed by 3 levels, where it was 2 computers

per level. For 781 computers it was created a tree composed by 5 levels, where it

was 5 computers per level.

This section uses tables to present the simulations data, because they manifest

more data precision. It was used Poisson to simulate the process arrival to the

system, the same probability distribution function used by Zhou and Ferrari [4],

Theimer and Lantz [5] and Shivaratri, Krueger and Singhal [2].

The following sections show the simulation groups and their respective analyzes.

A) Comparing the Lowest and TLBA response times at an

environment composed by 7 computers

Table 1 presents the results obtained by the simulations conducted at an

environment composed by 7 homogeneous computers. The modified Lowest

considered a subnet of three computers to define the idlest one and, therefore, to

receive a process initiated in the system. The Poisson probability distribution

function used to define the processes arrival is 100 seconds, in average. The

capacity of each computer in the system is 368 cycles per millisecond. 320

processes have been submitted to the system.

12

TABLE 1 – Response time variation of the Lowest algorithm to the different occupations of the

process

Occupation per

Process (in

milli on of cycles)

Response Time of the modified

Lowest algorithm in a

homogeneous environment (in

milli seconds)

TLBA Response Time in a

homogeneous environment

(in milli seconds)

10 612755.33 604501.43

100 6394020.73 6317613.53

1000 64152160.20 63446532.46

10000 640987379.63 634735601.75

100000 6418958569.10 6347626353.23

1000000 65020352364.06 63476532508.20

10000000 643614100672.71 634765594915.75

100000000 6426120894695.79 6347656220029.75

1000000000 64770720080292.50 63476562469883.10

10000000000 642306385839865.00 634765624969849.00

100000000000 6576426630405830.00 6347656249969880.00

1000000000000 64104959239102000.00 63476562499968600.00

Through table 1 it were obtained the exponential regressions of each algorithm on

the simulated situation. Following this, equations that express the behavior of the

algorithms on the simulated situation were obtained. The equations of the

modified Lowest and TLBA algorithms are presented on figures 2 and 3,

respectively.

y = 63,011 e 2.3054 x

Figure 2- Equation of the modified Lowest algorithm for the simulation at an environment

composed of 7 homogeneous computers

y = 62,358 e 2.3046 x

Figure 3- Equation of the TLBA for the simulation at an environment composed of 7

homogeneous computers

It should be mentioned that every x of the exponential equations reflects the

unitary number of cycles and y reflects the response time. Thus, in order to

analyze an application load with a milli on cycles it should be used x = 1,000,000.

13

By analyzing the exponential equations obtained through the simulations of the

TLBA and the modified Lowest, it may be concluded that the TLBA equation

presents the shorter response time. This may be noted through the analysis of the

equation thetas. The TLBA theta1 is 62,358 whereas the modified Lowest one is

63,011 – in this case, the lower is the theta the shorter will be the generated

response time. Besides that, the theta2 is 2,3046 at TLBA and 2,3054 at the

Lowest. Therefore, TLBA generates a function that takes longer to increase the

response times.

Table 2 shows the results obtained on the simulations conducted at an

environment composed of 7 heterogeneous computers. The simulated situation is

the same one of the homogeneous environment where the Poisson probabilit y

distribution function related to the process arrival is 100 seconds in average; the

capacity of each computer in the system is 368 cycles per milli seconds. where 320

processes have been submitted to the system and a subnet of 3 computers was

taken into account so that the modified Lowest could find the idlest computer.

TABLE 2 – Lowest algorithm response times for the different process occupation at a

heterogeneous system

Occupation per

Process (in milli on of

cycles)

Response Time of the

modified Lowest algorithm in

a heterogeneous environment

(in milli seconds)


heterogeneous environment

(in milli seconds)

10 874367.01 834312.15

100 8929430.36 8626009.67

1000 89918432.93 86541094.87

10000 897244022.45 865692066.19

100000 8969711464.11 8657202652.08

1000000 90185968862.68 86572308609.92

10000000 902407938340.81 865723367688.47

100000000 8981001449908.22 8657233958852.14

1000000000 89777274689289.90 86572339868669.00

10000000000 898600800041058.00 865723398968154.00

100000000000 8955454881991950.00 8657233989964290.00

1000000000000 89532098573832300.00 86572339899923800.00

14

The exponential equations that express the behavior of the modified Lowest and

TLBA algorithms are shown on figures 4 and 5, respectively.

y = 88,996 e 2.30x

Figure 4 – Equation of modified Lowest algorithm on the simulation at an environment composed

of 7 heterogeneous computers

y = 85,415 e 2.30x

Figure 5 – Equation of TLBA algorithm on the simulation at an environment composed of 7

heterogeneous computers

On the exponential equations of f igures 4 and 5, which were obtained from the

simulations on the Lowest and TLBA algorithms, respectively, it may be

concluded that the theta1 of the Lowest equation is higher, what implies in

generating longer response times. Therefore, the TLBA behavior is also better at

heterogeneous environments, showing shorter response times.

The exponential equations obtained on the simulations at environments composed

of 7 computers may be used to forecast the algorithm behavior at different loads.

The equations may be applied to different process occupation values (x axis),

resulting in response times (y axis). In all cases, TLBA has generated exponential

equations with lower thetas, so it can be defined as the best algorithm for the

presented cases.

B) Comparing the response time of Lowest and TLBA at an

environment composed of 781 computers

The following simulations compare the modified Lowest to the TLBA at a

homogeneous environment composed of 781 computers. On these simulations,

35,145 processes were submitted to the environment, based on the Poisson

probabilit y distribution function with average 100 seconds. Each computer

capacity is 368 cycles/ms. For the Lowest algorithm it was considered a subnet of

eleven computers that were selected on a random basis and their loads were

analyzed to allow the definition of the idlest computer. Table 3 presents the results

obtained on the simulations.

15

TABLE 3 – Response time related to the occupation per process of Lowest algorithm

Occupation per

Process (in milli on

of cycles)

Response Time of the modified

Lowest algorithm in a

homogeneous environment (in

milli seconds)


homogeneous environment

(in milli seconds)

10000 621599956.08 621564741.84

100000 6246908556.10 6246573454.81

1000000 62500016636.37 62496667285.42

10000000 625031931013.69 624997605085.41

100000000 6250345430262.57 6250006980417.56

1000000000 62503491403134.80 62500100730700.10

10000000000 625034828968439.00 625001038228383.00

100000000000 6250347547420140.00 6250010413228300.00

1000000000000 62503488649116800.00 62500104163224200.00

By doing the exponential regressions on the data obtained and presented on table

3, the exponential equations were generated and they express the behavior of the

modified Lowest algorithm (figure 6) and TLBA (figure 7).

y = 62,337,175.42 e 2.30x


of 781 homogeneous computers

y = 62,333,740.06 e 2.30x


homogeneous computers

Analyzing the resulting equations (figures 5 and 6), we may note that the theta1

for the TLBA exponential equation is lower than the one at modified Lowest and

this implies in shorter response times when using the TLBA load balancing.

The same environment composed of 781 computers was simulated, however, this

time, computers with heterogeneous capacities were used: 391 of them had a 368

cycles/ms capacity and 390 of them had a 139 cycles/ms capacity. The obtained

results on these simulations are shown on table 4.

16

TABLE 4 – Response times obtained at different process occupations of Lowest algorithm

Occupation per

Process (in milli on

of cycles)

Response Time of the

modified Lowest algorithm in

a heterogeneous environment

(in milli seconds)


heterogeneous environment (in

milli seconds)

10000 903520989.32 903494690.55

100000 9066120728.99 9065872768.52

1000000 90692197539.30 90689656980.88

10000000 906952556067.72 906927490948.86

100000000 9069556015328.95 9069305844417.15

1000000000 90695569517117.60 90693089372139.10

10000000000 906956131430007.00 906930924662546.00

100000000000 9069558757488640.00 9069309277553760.00

1000000000000 90695602819271600.00 90693092806468100.00

Through the information shown on table 4, regressions to the modified Lowest

algorithm and TLBA were generated. Out of these regressions the exponential

equations that express the behavior of both algorithms were obtained. Figures 8

and 9 show, respectively, the equation for the modified Lowest algorithm and

TLBA.

y = 90,529,356.61 e 2.30x


of 781 heterogeneous computers

y = 59,753.94 e 2.30x


heterogeneous computers

Observing the equations of f igures 8 and 9 it may be noted that the theta1 related

to the results of the Lowest algorithm is much higher than the one of TLBA and

this confirms the worst response times of the modified Lowest. Despite Lowest is

showing response times close to the ones presented by the TLBA, see on tables 3

and 4 that these times get much worse when the environment load increases.

17

C) Simulations that analyze the number of messages

generated for the TLBA and modified Lowest algorithms

Simulations were carried out to analyze the traff ic of messages exchanged among

the computers when using the TLBA and modified Lowest algorithms. The

simulated environment is composed of 7 computers segmented into three levels

with two computers on each level. In the modified Lowest case, the subset of

computers analyzed before the load balancing operation is composed of 3

computers. The number of processes that arrive to the system varies from 10 to

350. The Poisson probabilit y distribution function related to the process arrival is

average 3 seconds. Each computer capacity is 368 cycles per milli second.

The graph on figure 10 shows the number of messages exchanged among the

computers on Lowest algorithm and among TLBA levels.

Messages by Level at na Environment composed of 7 homogeneous computers

0

500

1000

1500

2000

2500

3000

10 50 90 130

170

210

250

290

330

Number of Processes

Mes

sage

s by

Lev

el

TLBA

Lowest

TLBA single network

FIGURE 10 – Number of messages exchanged among the computers on Lowest and TLBA

algorithms

Due to the TLBA tree format it is possible to segment each level at a physically

distinct network and this causes less overload on the communication means.

Figure 10 shows three simulated situations: the first one using the TLBA where

there is a physically distinct network for each level; the modified Lowest; and the

TLBA at a physically single network.

The TLBA generates fewer messages per level than the ones generated by the

modified Lowest. The increase rate related to the number of messages that pass

18

through the communication network is much higher on Lowest. Even when using

the TLBA at a single network, there is a significant decrease on the number of

messages when compared to the Lowest.

Both TLBA and modified Lowest show variations on the number of messages

depending on the number of executed processes. Lowest generates messages

based on the computers subnet analyzed to initiate a load balancing operation,

whilst TLBA generates a deep search message about the tree to search the best

computer to execute the process. The TLBA also generates updating messages on

the environment occupation and these messages run from the leaves to the root of

the virtual tree.

TABLE 5 – Number of messages achieved at 350 processes of the environment

Modified Lowest TLBA

TLBA at a single

network

2450 283.3333 850

Analyzing table 5, we may conclude that the number of messages on the TLBA is

8.647 times lower. The TLBA on a single network generates 2.88235 times less

messages than the modified Lowest. On both situations the TLBA causes less

overload to the communication means.

B) Simulations to analyze the number of messages that

pass through the communication network on the modified

Lowest and TLBA algorithms based on the number of

computers per level

Simulations were conducted to analyze the impacts caused by the variation on the

number of levels and number of computers per level that make up the TLBA tree.

The environment used on these simulations contained 781 homogeneous

computers with a capacity of 368 cycles per milli second. 3,000 processes were

submitted to the environment based on the Poisson probabilit y distribution

function of average 3 seconds. Three situations were simulated for the TLBA:

1) The first one had two levels with one computer at the first level and 780 at the

second one;

2) the second one had five levels with one computer at the first level and five

computers on the remainder ones;

19

3) the third one with 781 levels with one computer per level.

Messages per level for different number of Computers per Level

0.00

100000.00

200000.00

300000.00

400000.00

500000.00

600000.00

Number of Computers per Level

Mes

sage

s pe

r L

evel

Lowest

Tree

Lowest 69000.00 69000.00 69000.00

Tree 5996.00 182.36 566133.00

780.00 5.00 1.00

FIGURE 11 – Messages exchanged among computers segmented by level

Figure 11 shows the results obtained on the TLBA and modified Lowest

algorithm. We may note that Lowest generates a constant number of messages

that are equal to 69,000, whilst the number of messages on TLBA varies

according to the number of levels and computers per level. This variation is about

182.36 for 5 levels, 5,996 for 2 levels and 566,133 for 781 levels.

The choice for an intermediate number of levels generates a lower traff ic on the

communication means, what increases the system scalabilit y. When a computer

per level is used, there are no interesting results because there is an overload on

the communication among the computers, what decreases the final eff iciency of

the system making it not scalable. The overload happens due to the number of

levels roamed to find out the best computer in the system to execute a received

process and to update the information that propagate forward the root computer.

THE DEFINITION OF AN INTERMEDIATE NUMBER OF LEVELS ALLOWS THE CREATION

OF SUBNETS. EACH SUBNET IS FORMED BY A NUMBER OF COMPUTERS. THE SYSTEM

SEGMENTATION INTO LEVELS COMPOSED OF PHYSICALLY DISTINCT NETWORKS

GENERATES A LOWER TRAFFIC ON THE COMMUNICATION MEANS, AS MOST PART OF

THE MESSAGES IS EXCHANGED AMONG THE COMPUTERS OF A CERTAIN LEVEL,

WHICH TRANSIT ON THE PHYSICAL NETWORK OF THIS LEVEL. TO CREATE A SUBNET

20

IS NECESSARY A SWITCH PER LEVEL. FOR INSTANCE, A SWITCH IS USED TO

INTERCONNECT THE COMPUTER PLACED IN A N LEVEL, TO ONES OF THE N-1 LEVEL.

A COMPUTER OF THE N-1 LEVEL NEEDS TO HAVE TWO NETWORK INTERFACE CARDS,

ONE TO CONNECT TO N LEVEL THROUGH ONE SWITCH AND ANOTHER TO CONNECT

TO N-2 LEVEL. ALTHOUGH TO CREATE DIFFERENT SUBNETS, THE SYSTEM

ADMINISTRATOR MUST DEFINE THE TREE STRUCTURE AS STATIC, SO HE/SHE CANNOT

USE THE BROKER AS IMPLEMENTED SO FAR (SECTION III .E).

V. Comparative Evaluation of the Results Obtained

on the Executions of a TLBA Prototype and

Simulator

In order to confirm the TLBA’s feasibilit y, a prototype was implemented using

the GNU/GCC, a Free Software compiler that executes several platforms [23],

and the Linux operating system as the test environment. This section compares the

results obtained on the executions of this prototype to the simulation results.

The prototype was submitted to two environment configurations: a homogeneous

one and a heterogeneous one. The homogeneous environment was composed of

seven computers with a 368 cycles/ms capacity. The heterogeneous environment

was composed of 2 computers with a 139 cycles/ms capacity and 5 computers

with a 368 cycles/ms capacity. The simulated and real environment used the same

parameters this permit to compare both.

Table 6 presents the results obtained on the prototype executions and the

simulation results at homogeneous environments. The executions were carried out

on processes that occupy 10,330,000 cycles on three distinct situations. On the

first case, processes that occupied only CPU, without doing the context key, were

submitted. On the second case, processes that occupied the CPU and made calls to

the kernel’s sleep function were submitted, which totaled 512,500 milli seconds

per process. On the third and last case, processes that made up sleep time of

1,024,000 milli seconds each were submitted. These sleep times were defined so

that the processes could last a longer time to be completed, which allowed the

whole system to have more processes executing simultaneously. In all cases, 350

processes were submitted to the environment by using a Poisson function for

probabilit y distribution of process arrival with average 3 seconds. Two computers

21

per level were defined to make up a communication tree among the TBLA

modules, as shown on figure 12. All computers were connected to a 100 Mbits

switch.

Figure 12 – Connecting tree created by TLBA

The number and occupation of processes submitted to the environment were

defined to obtain a TLBA analysis for high load environments. The three analyzed

cases showed the results in situations where there are load peaks; situations were

the high load is maintained for an average period of time; situations where the

load is maintained for long periods of time.

TABLE 6 – Response times at the homogeneous environment (using the prototype and the

simulator)

Cycles

Simulator (In

milli seconds)

Prototype (Actual Environment) - In

milli seconds Variation 10330000 698,125.6005 697,645 0.00068889

10330000 + 512500 sleep ms 1,210,625.6 1,194,509 0.013492239

10330000 +1024000 sleep ms 1,722,125.6 1,760,153 0.022081664

Level 1

Level 2

Level 3

Computer

Computer Computer

Computer Computer Computer Computer

22

Actual Execution x Simulation for 7 homogeneous computers

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

10330000 10330000 + sleep500 * 1024

10330000 + sleep500 * 2048

Execution Types

Res

po

nse

Tim

e (i

n m

illis

eco

nd

s)

Simulation

Actual Execution

FIGURE 13 – Response Times for the prototype executions and simulations at a homogeneous

environment

Observing table 6, a small variation between the actual and simulated values may

be noted. This variation is from 0.068889% to 2.2081664%. The low variance on

results validates the simulator use for the feasibilit y analysis on the creation of

load balancing environments with the TLBA. The graph on figure 13 shows these

results.

Table 7 shows the average number of load balancing messages exchanged on the

actual environment, represented by the prototype and simulation executions. The

variance on the number of messages between the actual system to the simulated

one is of 5.8823529%, and this figure also confirms that the simulator can be used

to forecast the behavior of a certain environment that aims to use the TLBA.

Table 7 shows a column for the number of messages exchanged among the levels

and another one for the total messages exchanged on the environment, if the

option of using a single network is chosen, the number of generated messages is

found on the “Total” column. If the option of using physically distinct networks

is chosen, the average number of messages is found on the “per level” column.

23

TABLE 7 – Messages passing through the network to the actual and simulated environments

Per Level Total Simulator 283.3333333 850

Prototype (Actual Environment) 300 900 Variation 0.058823529 0.058823529

The actual executions and simulations on the heterogeneous environment were

conducted according to the same pattern followed on the homogeneous

environment. The heterogeneous environment was composed of 5 computers with

a 368 cycles/ms capacity and 2 computers with a 139 cycles/ms capacity.

Table 8 shows the results obtained on the executions of the TLBA prototype and

simulator at a determined heterogeneous environment. The processes were

submitted on the same three situations previously defined for the homogeneous

environment. On each test, 350 processes were submitted to the environment by

using a Poisson function of distribution of the probabilit y of average 3 seconds. It

may be noted that the variation between the simulated response times and the

actual ones was from 1 to 4%. The graph on figure 14 shows these results.

TABLE 8 – Response times at the heterogeneous environment (using the prototype and the

simulator)

Cycles

Simulator (In

milli seconds)

Prototype (Actual environment) -

In milli seconds Variation 10330000 781,348.8903 750,352.93 0.04

10330000 + 512500 ms de sleep 1,293,848.89 1,279,203.00 0.01

10330000 +1024000 ms de sleep 1,805,348.89 1,742,498.00 0.04

24

Actual Execution x Simulation for 7 heterogeneous computers

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

10330000.00 10330000 + sleep500 * 1024

10330000 + sleep500 * 2048

Execiution Types

Res

po

nse

Tim

e (i

n m

illis

eco

nd

s)

Simulation

Actual Execution

FIGURE 14 – Response times for the prototype executions and simulations at a heterogeneous

environment

Table 9 shows the number of messages generated both on the actual environment

(prototype) and on the simulated one. It may be noted tat the variation between

the expected and actual number of messages is of about 7%.

TABLE 9 – Messages passing through the network to the actual and simulated environments

Per level Total Simulator 257 771

Prototype (Actual Environment) 275 825 Variation 0.070038911 0.070038911

As it may be observed, the values obtained on the simulations were confirmed

afterwards by the prototype execution and this validates the performed

simulations. It can also be highlighted that all actual executions were conducted

with the aim of overloading the system. Thus, the comparisons between the

prototype (actual system) and simulator could be made under extreme situations.

On these situations the low variation on the number of exchanged messages to

perform the load balancing operations and processes response times were

confirmed.

25

VI. Conclusions

The paper proposes and analyzes the TLBA algorithm, which creates a virtual

connecting tree to update the information on the system occupation. This tree is

used to carry out deep searches that allow the idlest computer to be found as this

computer is the one that should execute a process received by the system at a

certain instant. Analyzing the simulations presented on section IV, it was

concluded that the TLBA algorithm decreases the number of messages generated

at a certain environment and generates shorter response times than the modified

Lowest algorithm [10].

In order to confirm the results obtained on the simulations, a prototype was

implemented over the Linux operating system. This prototype was submitted to

tests whose results allowed the confirmation of the low variation on the response

times and number of exchanged messages on the load balancing operations of the

simulated and actual (prototype) environments.

The low variances between the TLBA prototype and simulator confirmed the

simulator usefulness to forecast the behavior of environments that use the TLBA

for load balancing and, consequently, the validation of the comparisons made

between the TLBA and modified Lowest algorithms (section IV) where the TLBA

showed the best results.

References

[1] KRUEGER, P. and LIVNY, M., “The Diverse Objectives of Distributed Scheduling Policies” , Proc.

Seventh Int’ l Conf. Distributed Computing Systems, IEEE CS Press, Los Alamitos, Cali f., Order Nr

801 (microfiche only), 1987, pp. 242-249.

[2] SHIVARATRI, N. G., KRUEGER, P. and SINGHAL, M., “Load Distributing for Locally Distributed

Systems”, IEEE Computer, December 1992, pp. 33-44.

[3] SINHA, P. K., “Distributed Operating Systems: Concepts and Design” , John Wiley & Sons, 1997.

[4] ZHOU, S. and FERRARI, D., “An Experimental Study of Load Balancing Performance”, Report Nr

UCB/CSD 87/336, January 1987, PROGRES Report Nr 86.8, Computer Science Division (EECS),

University of Cali fornia, Berkeley, Cali fornia 94720.

[5] THEIMER, M. M. and LANTZ, K. A., “Finding Idle Machines in a Workstation-Based Distributed

Systems”, In Proc. IEEE 8th Int. Conf. On Distributed Computing Systems, pp. 112-122, 1988.

26

[6] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., GONZAGA, A., “Model for Resources Load

Evaluation in Clusters by Using a Peer-to-Peer Protocol” , The 2002 International Conference on

Parallel and Distributed Processing Techniques and Applications, Las Vegas, USA, 24 - 27 June 2002.

[7] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., “OpenTella: A Peer-to-Peer protocol for the load

balancing in a system formed by a cluster from clusters” , 4th International Symposium on High

Performance Computing ISHPC-IV, Japan, May 2002.

[8] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., “Analysis on the Significant Information to Update

the Tables on Occupation of Resources by Using a Peer-to-Peer Protocol” , 16th Annual International

Symposium on High Performance Computing Systems and Applications, 17-19 June 2002, Moncton,

New-Bruncwick, Canadá.

[9] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., “A Java Cluster Management Service”, IEEE

International Symposium on Network Computing and Applications, February 2002, Cambridge,

Massachusetts, USA.

[10] MELLO, R. F., TREVELIN, L. C., PAIVA, M. S., “Comparative Study of the Server-Initiated Lowest

Algorithm using a Load Balancing Index Based on the Process Behavior for Heterogeneous

Environment” , Submitted to the Kluwer The Journal of Networks, Software Tools and Applications,

August 2003.

[11] HUI, C.; CHANSON, S.T. (1999). Hydrodynamic Load Balancing. IEEE Transactions on Parallel

and Distributed Systems, v. 10, n.11, p. 1118-1137, November 1999.

[12] MITZENMACHER, M. (2000). How Useful Is Old Information? IEEE Transactions on Parallel and

Distributed Systems, v. 11, n.1, pp. 6-20, January 2000.

[13] ANDERSON, J.R.; ABRAHAM, S. (2000). Performance-Based Constraints for Multidimensional

Networks. IEEE Transactions on Parallel and Distributed Systems, v. 11, n.1, p. 21-35, January 2000.

[14] AMIR, Y. et al. (2000). An Opportunity Cost Approach for Job Assignment in a Scalable Computing

Cluster. IEEE Transactions on Parallel and Distributed Systems, v. 11, n.7, p. 760-768, July 2000.

[15] KOSTIN, A.E.; AYBAY, I.; OZ, G. (2000). A Randomized Contention-Based Load-Balancing

Protocol for a Distributed Multiserver Queuing System. IEEE Transactions on Parallel and

Distributed Systems, v. 11, n.12, p. 1252-1273, December 2000.

[16] FIGUEIRA, S.M.; BERMAN, F. (2001). A Slowdown Model for Applications Executing on Time-

Shared Clusters of Workstations. IEEE Transactions on Parallel and Distributed Systems, v. 12, n.6,

p. 653-670, June 2001.

[17] ZOMAYA, A.Y.; TEH, Y. (2001). Observations on Using Genetic Algorithms for Dynamic Load-

Balancing. IEEE Transactions on Parallel and Distributed Systems, v. 12, n.9, p. 899-911, September

2001.

27

[18] ZHANG, Y. (2001). Impact of Workload and System Parameters on Next Generation Cluster

Scheduling Mechanisms. IEEE Transactions on Parallel and Distributed Systems, v. 12, n.9, p. 967-

985, September 2001.

[19] MITZENMACHER, M. (2001). The Power of Two Choices in Randomized Load Balancing. IEEE

Transactions on Parallel and Distributed Systems, v. 12, n.10, p. 1094-1104, October 2001.

[20] WOODSIDE, C.M.; MONFORTON, G.G. (1993). Fast Allocation of Processes in Distributed and

Parallel Systems. IEEE Transactions on Parallel and Distributed Systems, v. 4, n.2, p. 164-174,

February 1993.

[21] PAGE, I; JACOB, T.; CHERN, E. (1993). Fast Algorithms for Distributed Resource Allocation. IEEE

Transactions on Parallel and Distributed Systems, v. 4, n.2, p. 188-197, February 1993.

[22] HSU, T. et al. (2000). Task Allocation on a Network of Processors. IEEE Transactions on Parallel

and Distributed Systems, v.49, n.12, p.1339-1353, December 2000.

[23] GNU/GCC Supported Platforms.

URL: http://gcc.gnu.org/install/specific.html

Consulted in: June 25th, 2003

Documents

Support for High Performance Scientific and Engineering ...cse.stfx.ca/~ltyang/pact03-spdsec/informal-proceedings.pdf · Messages from the PACT-SHPSEC03 Workshop Chairs Welcome to