Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The 2nd Workshop on Hardware/Software Support for High Performance Scientific and
Engineering Computing
SHPSEC-03
New Orleans, Louisiana, Sep. 27--Oct. 1, 2003
in conjunction with
The 12th International Conference on Parallel Architectures and Compilation Techniques (PACT-03)
Messages from the PACT-SHPSEC03 Workshop Chairs
Welcome to the 2nd International Workshop on Hardware/Software Support
for High Performance Scientific and Engineering Computing (SHPSEC-03)
held in conjunction with the 12th International Conference on Parallel
Architectures and Compiler Techniques (PACT-03) in New Orleans,
Louisiana, USA, Sep. 27-Oct. 1, 2003.
The field of parallel and distributed processing has obtained
prominence through advances in electronic and integrated technologies
beginning in the 1940s. Current times are very exciting and the years
to come will witness a proliferation in the use of parallel and
distributed systems, or supercomputers. The scientific and
engineering application domains have a key role in shaping future
research and development activities in academia and industry.
The purpose of this workshop is to provide an open forum for computer
scientists, computational scientists and engineers, applied
mathematicians, and researchers to present, discuss, and exchange
research-related ideas, results, works-in-progress, and experiences
in the areas of architectural, compilation, and language support for
problems in science and engineering applications.
Possible topics include, but are not limited to:
• Parallel and distributed architectures and computation models
• Hardware/software co-design Compilers, hardware, tools,
programming languages and algorithms, operation system support
and I/O issues
• Techniques in the context of JAVA (Java multi-threading, Java
processors with parallelism, novel compiler optimizations for
Java, etc.)
• Compiler/hardware support for efficient memory systems
• Novel uses for threads to improve performance or power
• Software dynamic translation and modification
• Hardware/software and technological support for energy-aware
applications
• Network, Mobile/wireless processing and computing
• Software processes for development/maintenance/evolution of
scientific and engineering software
• Cluster and grid computing
• The application areas are to be understood very broadly and
include, but are not limited to, computational fluid dynamics
and mechanics, material sciences, space, weather, climate
systems and global changes, computational environment and energy
systems, computational ocean and earth sciences, combustion
system simulation, computational chemistry, computational
physics, bioinformatics and computational biology, medical
applications, transportation systems simulations,
combinatorial and global optimization problems, structural
engineering, computational electro-magnetics, data mining,
computer graphics, virtual reality and multimedia,
computational finance, semiconductor technology, and
electronic circuits and system design, signal and image
processing, and dynamic systems, etc.
There were a very large number of paper submissions from not only
the Asian Pacific, but also Europe and North America. All
submissions were reviewed by at least three committee members or
external reviewers. It is extremely difficult to select the
presentation on the workshop because there were many excellent and
interesting submissions. In order to allocate as many papers as
possible and keep the high quality of the workshop, we finally decide
to accept 25 papers for oral technical presentation on the workshop.
We believe all of these papers and topics will not only provide novel
ideas, new results, work in progress and state-of-the-art techniques
in this field and but also stimulate the future research activities in
the area of hardware/software support for high performance computing
in science and engineering applications.
The program for this workshop is the result of hard and excellent work
of many others, reviewers and program committee members. We would like
to express our sincere appreciation to all authors for their valuable
contributions and to all program committee members and external
reviewers for their cooperation in completing the workshop program
under a very tight schedule. Last but not least, we thank Dr. Martin
Schulz and Prof. J. Ramanujam, the workshops chairs of PACT-03, for
their patience and effort for helping and encouraging the inclusion
of SHPSEC-03 in PACT-03.
SHPSEC-02 Program Chairs
Laurence Tianruo Yang, St. Francis Xavier University, Canada
Minyi Guo, University of Aizu, Japan
SHPSEC-02 Program Vice-Chairs
Martin Buecker, Aachen University of Technology, Germany
Jiannong Cao, The Hong Kong Polytechnic University, P. R. China
Weng-Long Chang, Southern Taiwan University of Technology, Taiwan
Vipin Chaudhary, Wayne State University, USA
Beniamino Di Martino, Second University of Naples, Italy
Weijia Jia, City University of Hong Kong, P. R. China
Darren J. Kerbyson, Los Alamos National Lab., USA
Jie Li, University of Tsukuba, Japan
Thomas Rauber, University of Bayreuth, Germany
Gudula Runger, Technical University of Chemnitz, Germany
Technical and Program Committee:
Hamid R. Arabnia University of Georgia, USA
Ioana Banicescu Mississippi State University, USA
Subhash Bhalla University of Aizu, Japan
Anu Bourgeois Georgia State University, USA
Martin Buecker Aachen University of Technology, Germany
J. Carlos
Cabaleiro Universidade Santiago de Compostela, Spain
Kirk W. Cameron University of South Carolina, USA
Jiannong Cao The Hong Kong Polytechnic University, P. R.
China
Kasidit Chanchio Oak Ridge National Laboratory, USA
Weng-Long Chang Southern Taiwan University of Technology,
Taiwan
Barbarra Chapman University of Houston, USA
Vipin Chaudhary Wayne State University, USA
Beniamino Di
Martino Second University of Naples, Italy
Ramon Doallo Universidade da Coruna, Spain
Rudolf Eigenmann Purdue University, USA
Thomas Fahringer University of Vienna, Austria
Len Freeman University of Manchester, UK
Michael Gerndt Technical University of Munich, Germany
John L. Gustafson Sun Microsystems, Inc. USA
Dora B. Heras Universidade Santiago de Compostela, Spain
Pao-Ann Hsiung National ChungCheng University, Taiwan,
R.O.C
Constantinos
Ierotheou University of Greenwich, UK
Weijia Jia City University of Hong Kong, P. R. China
Hai Jin Huazhong University of Science and
Technology, P.R. China
D.J. Kerbyson Los Alamos National Laboratory, USA
Zhiling Lan Illinois Institute of Technology, USA
Jie Li University of Tsukuba, Japan
Sanli Li Tsinghua University, P. R. China
Zhiyuan Li Purdue University, USA
Xiaola Lin University of Hong Kong, P. R, China
Yong Luo Intel, USA
Russ Miller State University of New York at Buffalo, USA
Frank Mueller North Carolina State University, USA
Michael K. Ng University of Hong Kong, P. R. China
John O'Donnell University of Glasgow, UK
Stephan Olariu Old Dominion University, USA
Mohamed
Ould-Khaoua University of Glasgow, Scotland
Tomas F. Pena Universidade Santiago de Compostela, Spain
Janusz Pipin National Research Council of Canada, Canada
Constantine
Polychronopoulos University of Illinois at
Urbana-Champaign, USA
Xiangzhen Qiao Chinese Academy of Science, P.R. China
Ram J Ramanujam Louisiana State University, USA
Yi Pan Georgia State University, USA
Thomas Rauber University of Bayreuth, Germany
Jose D. P. Rolim University of Geneva, Switzerland
Gudula Runger Technical University of Chemnitz, Germany
Olaf Schenk University of Basel, Switzerland
Erich Schikuta University of Vienna, Austria
Stanislav G.
Sedukhin University of Aizu, Japan
Edwin H-M. Sha University of Texas at Dallas, USA
Gurdip Singh Kansas State University, USA
Peter Strazdins Australian National University, Australia
Eric de Sturler University of Illinois at
Urbana-Champaign, USA
Sabin Tabirca University College Cork, Ireland
Luciano
Tarricone University of Perrugia, Italy
Xinmin Tian INTEL, USA
Mateo Valero Universidad Politecnica de Catalunya,
Spain
Chengzhong Xu Wayne State University, USA
Jie Wu Florida Atlantic University, USA
Zhiwei Xu National Center for Intelligent Computing
Systems, P. R. China
Jingling Xue The University of New South Wales,
Australia
Zahari Zlatev National Environmental Research Institute,
Denmark
Xiaodong Zhang College of William and Mary, USA
Yu Zhuang Texas Technical University, USA
Weimin Zheng Tsinghua University, P. R. China
Bingbing Zhou Deakin University, Australia
Wanlei Zhou Deakin University, Australia
Hans Zima
Jet Propulsion Laboratory, California
Institute of Technology
and University of Vienna, Austria
Albert Y. Zomaya University of Sydney, Australia
List of Reviewers: Rocco Aversa, Alessandra Budillon, Fan Chan, Yanqing Ji,
Hai Jiang, Hai Jin, Dora B. Heras, Sascha Hunold,
Guan-Joe Lai, Lidong Lin, Man Lin, John O'Donnell,
Zina Ben Miled, Gennadiy Nikishkov, Chang Woo Pyo,
Xiangzhen Qiao, Arno Rasch, Gianmarco Romano,
Robert R. Roxas, Gianpaolo Saggese, Erich Schikuta,
Stanislav G. Sedukhin, Emil Slusanschi, Oliviero Talamo,
Yuichi Tsujita, Tu Wanqing, Alexander P. Vazhenin,
Andre Vehreschild, Salvatore Venticinque, John Walters,
Bo-Wen Wang, Guojun Wang, Jiahong Wang, Andreas Wolf,
Xingfu Wu, Yusaku Yamamoto, Rentaro Yoshioka, Binbing Zhou,
Jingyang Zhou, Dakai Zhu
ADVANCED PROGRAM
SEPTEMBER 27, 2003
8:00-8:30 am Registration and Breakfast
8:30-10:10 am Session 1: System Architectures (1) Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors for Applications with a Small Working Set Sung Woo Chung, Hyong-Shik Kim and Chu Shik Jhon Samsung Electronics, Chungnam National University and Seoul National University, Korea Traditional versus Next-Generation Journaling File Systems: a Performance Comparison Approach Juan Piernas, Toni Cortes and Jose M. Garcia Universidad de Murcia and Universitat Politecnica de Catalunya, Spain Super-Fast Active Memory Operations-Based Synchronization Lixin Zhang, Zhen Fang, John B. Carter and Michael A. Parker, University of Utah, USA Dynamic Code Repositioning for Java Shinji Tanaka, Tetesuyasu Yamada, and Satoshi Shiraishi, Network Service Systems Laboratories, NTT, Japan 10:10-10:30 am Coffee Break 10:30-12:10 am Session 2: Numerical Computations Designing Polylibraries to Speed Up Linear Algebra Computations Pedro Alberti, Pedro Alonso, Antonio Vidal, Javier Cuenca and Domingo Gimenez Universidad Politecnica de Valencia and Universidad de Murcia, Spain A Parallel Implementation of Multi-domain High-order Navier-Stokes Equations Using MPI Hui Wang, Yi Pan, Minyi Guo, Anu G. Bourgeois, Da-Ming Wei University of Aizu, Japan and Georgia State University, USA Developing an Object-oriented Parallel Iterative-Methods Library Chakib Ouarraoui and David Kaeli, Northeastern University, USA A Super-Programming Techniques for Large Sparse Matrix Multiplication on PC Clusters Dejiang Jin and Sotirios G. Ziavras, New Jersey Institute of Technology, USA 12:10-1:30 pm Lunch Break 1:30-3:10 pm Session 3: Scientific and Engineering Applications A Scene Partitioning Method for Global Illumination Simulations on Clusters M. Amor, J. R. Sanjurjo, E. J. Padron, A. P. Guerra and R. Doallo, University of da Coruna, Spain Efficient Scheduling for Low-Power High-Performance DSP Applications Zili Shao, Qingfeng Zhuge, Youtao Zhang, Edwin H.-M. Sha, University of Texas at Dallas, USA Robust Transmission of Wavelet Video Sequence Using Intra Prediction Coding Scheme Joo-Kyong Lee, Ki-Dong Chung, Pusan National University, Korea A Novel Hierarchical Architecture for Communication Delay Enhancement in Large-Scale Information Systems Khaled Ragab, Naohiro Kaji, Khoirul Anwar, and Kinji Mori Tokyo Institute of Technology, Japan
3:10-3:30 pm Coffee Break 3:30-5:10 pm Session 4: Language and Computation Model Problems with Using MPI 1.1 and 2.0 as Compilation Targets for Parallel Language Implementations Dan Bonachea and Jason Duell, University of California at Berkeley, USA Fast Parallel Solution For Set-Packing and Clique Problems by DNA-based Computing Michael (Shan-Hui) Ho, Weng-Long Chang, Minyi Guo, and Laurence Tianruo Yang Southern Taiwan University of Technology, Taiwan, University of Aizu, Japan and St. Francis Xavier University, Canada HPM: A Hierarchical Model for Parallel Computations Xiangzhen Qiao, Shuqing Chen et al. Chinese Academy of Sciences, P. R. China A Visual Environment for Specifying Global Reduction Operations Robert R. Roxas and Nikolay N. Mirenkov, University of Aizu, Japan
SEPTEMBER 28, 2003 8:30-10:10 am Session 5: System Architectures (2) VLaTTe: A Java Just-in-Time Compiler for VLIW with Fast Scheduling and Register Allocation Suhyun Kim, Soo-Mook Moon, Kemal Ebcioglu and Erik Altman Seoul National University, Korea and IBM T. J. Watson Research Center, USA Speeding-up Multiprocessors Running DSS Workloads through Coherence Protocols Pierfrancesco Foglia, Roberto Giorgi, Cosimo Antonio Prete, Universita’ di Pisa, Italy Toward Optimization of OpenMP Codes for Synchronization and Data Reuse Tien-hsiung Weng and Barbara Chapman, University of Houston, USA Hierarchical Interconnection Networks Based on (3, 3)-Graphs for Massively Parallel Processors Gene Eu Jan, Yuan-Shin Hwang, Chung-Chih Luo and Deron Liang National Taiwan Ocean University, Taiwan 10:10-10:30 am Coffee Break 10:30-12:35 am Session 6: Distributed Computing High System Availability Using Neighbor Replication on Grid M. Mat Deris, A.Noraziah, M.Y. Saman, A.Noraida and Y.W.Yuan, University College Terengganu, Malaysia MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes Namyoon Woo, Heon Y. Yeom and Taesoon Park Seoul National University and Sejong University, Korea Omega Line Problem in a Consistent Global Recovery Line MaengSoon Baik, SungJin Choi, JoonMin Gil, ChanYeol Park, HyunChang Yoo and ChongSun Hwang, Korea University and Korea Institute of Science and Technology Information, Korea A Lightweight P2P Extension to ICP Ping-Jer Yeh, Yu-Chen Chuang, and Shyan-Ming Yuan National Chiao Tung University and Macronix International Co., Ltd, Taiwan Comparative Analysis of the Prototype and the Simulator of a New Load Balancing Algorithm for Heterogeneous Computing Environments Rodrigo F. De Mello, Luis C. Trevelin, Maria Stela V. De Paiva et al. University of São Paulo and Federal University of São Carlos, Brazil
Session 1: System Architectures (1)
1
Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors forApplications with a Small Working Set
Sung Woo Chung, Hyong-Shik Kim† and Chu Shik Jhon††[email protected] [email protected] [email protected]
Processor Architecture P/TSystem LSI Division
Device Solution NetworkSamsung Electronics Co., LTD.
Suwon-Si, 442-742 KOREA
Department of Computer Science†Chungnam National University
Taejon, 305-764 KOREA
School of Electrical Engineering and Computer Science††Seoul National UniversitySeoul, 151-742 KOREA
AbstractIn CC-NUMA multiprocessor systems, it is important to reduce the remote memory access time.
Based upon the fact that increasing the size of the LRU second-level (L2) cache more than a certainvalue does not reduce the cache miss rate significantly, in this paper, we propose two split L2 cachesto utilize the surplus of the L2 cache. The split L2 caches are composed of a traditional LRU cacheand another cache to reduce the remote memory access time. Both work together to reduce total L2cache miss time by keeping remote (or long-distance) blocks as well as recently used blocks. Foranother cache, we propose two alternatives : an L2-RVC (Level 2 - Remote Victim Cache) and an L2-DAVC (Level 2 - Distance-Aware Victim Cache). The proposed split L2 caches reduce total executiontime by up to 27%. It is also found that the proposed split L2 caches outperform the traditional singleLRU cache of double size.
Keywords : CC-NUMA, Cache Replacement Policy, On-Chip Cache, Split L2 Cache, Remote VictimCache, Distance-Aware Cache, Interconnection Network
1. IntroductionMost processors heavily rely on the cache to reduce the average memory access time. There have
been many studies on reducing the average cache access time, which depends not only on the cachemiss rate, but also on the cache miss penalty. Increasing the cache block size and prefetching thecache block were proposed to reduce the cache miss rate, while the non-blocking cache, the criticalword first cache, and the cache hierarchy were proposed to alleviate the cache miss penalty. Byinserting the second level (L2) cache between the first level (L1) cache and memory, which is calledcache hierarchy, we could capture many accesses that would go out to memory otherwise. Another
2
advantage of cache hierarchy is to allow the small first level (L1) cache to keep up with the clockfrequency of a processor.
A current trend in the design of the L2 cache is an increase of associativity and capacity. As theassociativity of the L2 cache increases to 4 or even 8, the replacement of blocks in the cache becomesan important issue for the overall performance. The LRU replacement algorithm has been known towork moderately well with caches. With the LRU replacement policy, however, an increase over acertain value in cache size does not reduce the cache miss rate much. As shown in [1], with LRUcache replacement policy in 32 processor systems, increasing the cache size over 256 KB does notreduce cache miss rate much in most SPLASH-2 applications. They observe knees around the cachesize of 256 KB in most application programs. It means that the memory access patterns are not helpfulwhen the cache size is larger than 256 KB cache with most of the SPLASH-2 applications.
The L1 cache size is generally limited to catch up with a processor clock. On the other hand, the L2cache is large enough to reduce the L1 cache miss penalty. How can we utilize the increasing capacityin L2 cache if 256 KB is sufficient for the applications? In fact, the sufficient capacity could be512KB, 1MB or more in the applications. However, there would be limits to performanceenhancement with the LRU cache (we will use the term “LRU cache” to indicate the LRU cache inthe second level exclusively for the sake of clear understanding, even though the L1 also runs an LRUreplacement policy) of more than a certain capacity. If increasing the L2 cache size does not increasethe cache hit rate significantly in the processors of the multiprocessor systems, how do we make useof the surplus of the L2 cache?
In a multiprocessor system, higher cache hit rate does not necessarily mean better performance.Completing a local memory block miss does not depend on the transmission latency on theinterconnection network, while completing a remote memory block does. Thus the average cache misstime can be reduced by the remote cache [7] or the RVC [5, 6] even though it does not reduce thecache miss rate.
There have been several studies to reduce the remote memory access time and traffic on theinterconnections by inserting an additional cache. The remote cache, which only stores the remotedata, was proposed [7]. R. Iyer and L.N. Bhuyan introduced Switch Cache [8]. The main idea was toimplement small fast caches in crossbar switches of the interconnect to capture and store shared dataas they flow from the memory module to the requesting processor. A. Moga and M. Dubois proposedthe SRAM network cache at the outside of the processor [5], which is organized as a victim cache forremote data. The major appeal of SRAM network caches is that they add less penalty on the latency ofthe RVC hits. The mentioned works were targeted at reducing the average cache miss time byreducing the remote memory access time. The RVC was already proposed in [5, 6]. Different from theorganization of the RVC in [5, 6], the L2-RVC in this study is located in the second level and is notshared by multiple processors.
There also have been studies on the split cache. The split L1 data cache in the uniprocessor systemswas proposed by A. Gonzalez et.al [9]. Their split cache consists of the cache for the spatial localityand the cache for the temporal locality. The former is a traditional cache based on the LRUreplacement policy. The latter behaves as a victim cache, which only stores the recently used block ofthe total cache line and also runs the LRU replacement policy. The split L1 data cache in themultiprocessor systems was also evaluated [10]. Their split caches are based on the LRU replacementpolicy, resulting in the reduction of the cache miss rate.
In this paper, we propose a split L2 cache, composed of an LRU cache and another cache for themultiprocessor systems. To reduce the remote memory access time, we suggest the RVC (Level 2 -
3
Remote Victim Cache) and the L2-DAVC (Level 2 - Distance-Aware Victim Cache) for anothercache. Another cache keeps remote (or long-distance) victim blocks, though they are needed even farin future. The L2-RVC or the L2-DAVC operates just like a victim cache proposed by Jouppi [7],since it keeps blocks that were discarded by the LRU cache. However, it differs from the originalvictim cache in a sense that it is as large as its victim-provider (LRU cache) and that it issimultaneously accessed with its victim-provider.
In the next section, we describe the structure of the proposed L2 cache in detail. Section 3 explainsthe simulation methodology with the application programs used in our simulations and examines thesimulation results. Section 4 concludes the paper.
2. Split L2 cache structureFigure 1 shows a block diagram of the memory hierarchy with the cache hierarchy emphasized. A
split L2 cache is composed of an LRU cache and another cache, which behaves just like a victimcache of the LRU cache. The LRU cache that runs the LRU replacement policy still favors recentlyused blocks, while another cache keeps remote (or long-distance) blocks. Another cache is yet anotherset-associative cache simultaneously accessed with the LRU cache. In order to utilize the temporallocality, the LRU cache is still needed in the second level of the cache hierarchy. Since the blockevicted from the L1 cache is sent to the L2 cache of the same node, we only have to consider thecache miss rate when selecting a victim in the L1 cache. Only when a block is replaced in the L2cache, we have to consider the cache miss time as well as the cache miss rate.
Figure 1. Block diagram of the memory hierarchy with the split L2 cache
4
In a traditional two-level cache hierarchy, all cache blocks of the L1 cache are included in the L2cache. This inclusion property makes it easy to maintain the memory hierarchy; however, it wastescache capacity by duplicating data. We do not maintain the inclusion property between the L1 cacheand the L2 cache as [3, 12, 13, 14]. Instead, we keep a duplicate copy of the tags and the states of theL1 cache in the L2 cache controllers. Since the advantages and the disadvantages of inclusionproperty is already described in [3, 12] and the inclusion property was not maintained in many off-the-shelf processors such as Intel Xeon [14] and Alpha 21264 [13], we do not explain the efficiency ofinclusion property any more in this paper.
We propose the L2-RVC and the L2-DAVC for another cache. The L2-RVC contains only remoteblocks and it uses the LRU replacement policy to find a victim among the remote blocks. Accordingly,a remote block of the L2-RVC can be evicted not by a local block, but by another remote block only.Thus, in the total L2 cache, the total number of remote blocks should be substantial compared to thatof local blocks. Hence, the proposed L2 cache is expected to reduce the total cache miss time byreducing the cache miss rate of the very remote blocks.
The other choice for another cache is the L2-DAVC. The L2-DAVC does not use the LRUreplacement policy. Instead, it runs the shortest distance replacement policy, in which the block whosehome node is shortest is replaced first. If more than one block in a set has the same distance, thevictim is chosen randomly among them. The problem of the L2-DAVC is that the block whosedistance between the home node and the current node is farthest may reside in the L2-DAVC for quitea long time. Nevertheless, as the L2-DAVC is combined with the LRU cache, there is still a highprobability of recently used blocks being kept. This means that the recently used blocks are notsacrificed for the L2-DAVC.
Figure2. Operations of the LRU/another cache
5
Upon a cache miss in the L1 cache, the cache controller accesses the LRU cache and another at thesame time. As far as the cache miss is concerned, we have four cases shown in Figure 2, excluding anL2 coherence miss. When a block is not found in the L1 cache but it is in the LRU cache (case a), avictim of the L1 cache is moved to the LRU cache, which may make another victim there and thefound block is used to fill the empty L1 cache block. In the case that another cache is the L2-RVC, thepossible new victim, if it is a remote block, is replaced into the L2-RVC. If it is a local block, thepossible new victim is evicted to the local memory because the L2-RVC stores only remote blocks. Inthe case that another cache is the L2-DAVC, the possible new victim is compared with the cacheblocks in the appropriate set of the L2-DAVC and the block of the shortest distance is replaced whichlets the L2-DAVC store local blocks. When the L1 missed block is found in another cache (case b),the victim of the L1 cache is replaced into the LRU cache. The subsequent replacements are same as(case a). If the block is not found in either the LRU cache or another cache (case c), the cachecontroller sends a memory access request to either the local memory or interconnection. Because ofno-inclusion property, an L1 cache miss, which can not be resolved either in the LRU cache or inanother cache, should be filled directly from local memory or interconnection without allocating ablock in the LRU cache or another cache. The subsequent replacements are same as (case a).
In case of an L1 cache coherence miss (case d), an invalidation is issued. We omit the operation ofan L2 cache coherence miss, which is a combination of the L1 cache coherence miss and the L2 cachehit.
3. Simulations3.1 Simulation environments
Increasing the cache size over 256 KB does not increase the cache hit rate much in most ofSPLASH-2 applications, as shown in [1]. They observed knees around the cache size of 256 KB inmost application programs. We admit that 256 KB may not be sufficient in other applications. 256 KBis just a sample value for SPLASH-2 applications. In this section we show the impact of the split L2cache when 256 KB is sufficient for the LRU portion of the L2 cache.
We use an execution-driven simulator to evaluate the performance of the split L2 cache. Thesimulator is modified from MINT [4]. MINT executes each instruction in one simulation cycle andsends the events generated by memory operations to the architecture simulator, which then performsthe operations necessary to simulate the system architecture.
Table 1. System parameters
Parameter ValueNumber of processors 16, 32Cache line size 32 BSize of L1 cache 64 KBL1 cache access time 1 processor clockL2 cache access time 6 processor clocksMemory access time 50 processor clocksDirectory (SRAM)
access time20 processor clocks
Transfer delay betweenadjacent nodes of the ring
64 processor clocks
6
The system parameters are shown in Table 1. The simulated systems are CC-NUMAmultiprocessors with 16 or 32 nodes. Each node has a processor with its associated caches and amemory. When the L2 cache is only composed of the LRU cache, it is 8-way set associative. Recallthat there is no inclusion between the LRU cache and another cache. Accordingly, to make a faircomparison, we assume that the LRU cache and another cache are all 4-way set associative in the splitL2 cache. We assume the point-to-point link delay from NUMALinks, which are used in SGI Origin3000 [11]. Thus, the transfer delay between adjacent nodes on the ring is 64 processor clocks which isobtained by (60 processor clocks (= 20ns = 32B/(1.6GB/s)) + 4 processor clocks (store-and-forwaredelay)) with 1.6GB/s unidirectional point-to-point links, 32B cache line size, and 3GHz processorclock speed. We maintain a 2-bit state field per cache block, corresponding to the MSI protocol,which is modified from a typical MESI protocol. The simulated system interconnects nodes by dual-ring style point-to-point links, using a full map directory protocol. We took eight applications –BARNES, CHOLESKY, FFT, FMM, LU, OCEAN, RADIX and WATER-NSQUARED from theSPLASH-2 benchmark suite [1]. The characteristics of the selected applications are summarized inTable 2.
3.2 Simulation resultsFirst, we show the simulation results when another cache is the L2-RVC. We assume that LRU + X
means that the L2 cache is composed of an LRU cache and another cache is X. We also compare theperformance of the LRU + L2-RVC to that of the LRU + L2-DAVC.
To avoid possible confusion, we first introduce some definitions. Pure transmission time on theinterconnection network, excluding the delay due to any shortage of resources, is called,“Transmission time on the interconnection network”. Waiting time for the resources on theinterconnection network is called, “Delay on the interconnection network”. A notation of X / Y isused, where X and Y are the size of the LRU cache and that of the L2-RVC or of the L2-DAVC,respectively
Application Type Problem SizeBARNES Barnes-Hut hierarchical N-body method 16K particlesCHOLESKY Block sparse Cholesky factorization Tk15.OFFT FFT computation 65536 complex doubles
FMMHierarchical N-body method called adaptive fastmultipole method
16K particles
LU Blocked dense linear algebra512X512 matrix, block16
OCEAN Study of ocean movements 258X258 ocean grid
RADIX Radix sort2M integer key, radix1K
WATER-NSQUARED
Water simulation using O(n2) algorithm 512 molecules
Table 2. The characteristics of the applications
7
We assume that the execution time is composed of CPU time, L2 cache access time, directoryaccess time, memory access time, transmission time on the interconnection network, and delay on theinterconnection network.
There are two important factors we must observe to evaluate the impact of the split L2 cache on theexecution time. One is the L1 cache hit rate, shown in Table 3, and the other is the hit count at anothercache. A low L1 cache hit rate makes the L2 cache handle many cache operations. Many hits atanother cache mean that remote misses (with the L2-RVC) or long-distance misses (with the L2-DAVC) are effectively removed (or replaced by local misses). If we could have many hits at anothercache with a low L1 hit cache rate, the impact should be significant.
Applications P=16 P=32BARNES 0.993 0.993CHOLESKY 0.965 0.958FFT 0.971 0.973FMM 0.993 0.993LU 0.985 0.992OCEAN 0.948 0.965RADIX 0.987 0.986WATER-NSQUARED 0.999 0.998
Table 3. L1 cache hit rate as to the number of processors in the system
3.2.1 LRU + L2-RVCIn the first simulations, we examine the impact of another cache size by varying the size of the L2-
RVC. Figure 3 and Figure 4 show the L2 cache hit rate and the normalized execution time for the 16processor system, respectively. (Figure 3~6 is shown at the end of this paper. We will resize the figuresand allocate them at appropriate position when accepted) Note that the hits at the LRU cache can beassumed to be the same as the hits of 256KB traditional L2 cache.
BARNES, FMM, LU, and WATER-NSQUARED have very high L1 cache hit rates as shown inTable 3 and have relatively high L2 cache hit rates, as shown in Figure 3. There is little differenceamong the cache hit rates of the 256KB, 512KB, and 1MB traditional L2 caches. Thus, adopting thesplit L2 cache in these applications brings performance enhancements all the time, since there is littlegain of the L2 hit rate from increasing the size of the LRU cache. Even though the split L2 cache hitrate is low, it is capable of reducing much more execution times than the traditional L2 cache sinceremote block misses cost more than local block misses do. As shown in Figure 4, any combination ofthe split L2 caches outperforms the 512KB traditional L2 cache and even outperforms the 1MBtraditional cache with the above applications. Except FMM, it seems that 64KB is sufficient for theL2-RVC.
Radix also has a high L1 cache hit rate. As shown in Figure 3, however, the difference between thehit rate of 256KB LRU cache and that of the 512K LRU cache is outstanding. While the L2 cache hitrate of the 256K/64K(A notation of X / Y is used, where X and Y are the size of the LRU cache andthat of the L2-RVC, respectively) and that of the 256K/128K are remarkably low, the execution timeof the 256K/64K and that of the 256K/128K are not so different from that of the 512K traditional L2cache. It is believed that the hits at the L2-RVC reduce the execution time more than the hits at theLRU cache.
8
It is very risky to use the L2-RVC with the applications such as FFT, OCEAN, and CHOLESKY,in which there is much difference among the cache hit rates of the traditional L2 cache of 256KB,512KB, and 1MB. However, if the gap between the cache hit rates is compensated by the L2-RVC,the performance enhancement might be significant. Though there are few hits at the L2-RVC inCHOLESKY, the degradation of the execution time is negligible since the local memory access isdominant.
As the size of the L2-RVC increases, the hit rate of the LRU cache also increases linearly in FFT.The improvement of the execution time is significant because the L2-RVC outperforms the LRUcache in the L2 cache hit rate. In FFT, the increase of the L2-RVC significantly affects the executiontime because of a low L1 cache hit rate and a low L2 cache hit rate.
The result of OCEAN is similar to that of FFT. Due to the large working set, we need large L2-RVC to compensate the low cache hit rate of the LRU cache. It is also the reason why OCEAN is theonly application where our split L2 cache does not outperform the 1M traditional LRU cache. Theresults are still encouraging, nonetheless, in that the 256K/256K split L2 cache works better than thesame-size 512K traditional L2 cache in both cache hit rate and execution time.
Figure 5 and Figure 6 depict similar results when the number of processors is increased to 32. Theperformance gains in Figure 6 are more than those in Figure 4. In case of OCEAN, 256K/512K L2split cache finally outperforms the 1MB traditional L2 cache in the execution time. As the number ofprocessors increases, the maximum miss penalty and the average cache miss penalty tend to beincreased. It is possible, therefore, to reduce the execution time much more with 32 processors, sincethe split L2 caches effectively reduce misses which may incur more penalties otherwise.
LRU cache size 512 KB 1 MBNumber of processors 16 32 16 32BARNES 320 KB 320 KB 320 KB 320 KBCHOLESKY 384 KB 512 KB 384 KB 512 KBFFT 320 KB 320 KB 320 KB 320 KBFMM 320 KB 320 KB 320 KB 320 KBLU 320 KB 320 KB 320 KB 320 KBOCEAN 320 KB 512 KB 1 MB 768 KBRADIX 320 KB 320 KB 512 KB 320 KBWATER-NSQUARED 320 KB 320 KB 320 KB 320 KBAverage (Required capacity toperform not less than the LRUcache %)
328 KB(64.1%)
368 KB(71.9 %)
440 KB(43.0%)
400 KB(39.1 %)
Table 4. Required capacity of the LRU + L2-RVC to outperform the traditional LRU cache
We investigate the efficiency of the proposed LRU+L2-RVC in Table 4 where the required capacityof the LRU+L2-RVC to outperform the traditional LRU cache is presented. The LRU cache of theLRU + L2-RVC is fixed to 256 KB in Table 4. We change the size of the L2-RVC not continuouslybut discretely (64 KB, 128 KB, 256 KB, and 512 KB).
To surpass the 512 KB LRU cache, the required capacity of the LRU + L2-RVC is averagely just328 KB (64.1 %) and 368 KB (71.9 %) in 16 processors and in 32 processors, respectively. When wecompare the LRU + L2-RVC to the 1 MB LRU cache, the efficiency of the LRU + L2-RVC is more
9
prominent. On an average, only 440 KB (43.0 %) and 400 KB (39.1 %) are required to outperform the1MB LRU cache, in 16 processors and in 32 processors, respectively.
3.2.2 LRU + L2-DAVCTable 5 shows the L2 cache hit rates of 512 KB LRU cache, 1 MB LRU cache, LRU + L2-RVC
(256 KB/256 KB), and LRU+ L2-DAVC (256 KB/256 KB). The total size of the split L2 cache is512KB. In the LRU + X, the size of the LRU cache and that of another X are half of total L2 cachesize. Table 6 shows the normalized execution time of 512 KB LRU cache, 1 MB LRU cache, LRU +L2-RVC (256 KB/256 KB), and LRU+ L2-DAVC (256 KB/256 KB). All results, in Table 6, arenormalized to the 512KB LRU cache.
L2 cacheorganization
LRU (512 KB) LRU (1 MB) LRU + L2-RVC LRU + L2-DAVC
Number ofProcessors
16 32 16 32 16 32 16 32
BARNES 0.90 0.87 0.92 0.89 0.90 0.89 0.90 0.88CHOLESKY 0.21 0.22 0.26 0.25 0.16 0.19 0.18 0.20FFT 0.18 0.27 0.19 0.29 0.22 0.28 0.23 0.32FMM 0.59 0.56 0.64 0.61 0.61 0.58 0.59 0.57LU 0.90 0.76 0.91 0.77 0.91 0.77 0.90 0.77OCEAN 0.74 0.79 0.88 0.86 0.72 0.81 0.69 0.80RADIX 0.55 0.56 0.58 0.59 0.51 0.58 0.52 0.59WATER-NSQUARED
0.64 0.54 0.65 0.55 0.65 0.54 0.66 0.55
Table 5. L2 cache hit rate
In some applications such as CHOLESKY, FFT, RADIX and WATER-NSQUARED, the L2 cachehit rates of the LRU + L2-DAVC are higher than those of the LRU + L2-RVC in 16 processors.However, it does not lead to the significant reduction of the execution time. The reason is that thedistances between the nodes are so short that the differences in the execution time between the L2-RVC and L2-DAVC are little. Also in the case that the L2 cache hit rate of the LRU + L2-DAVC islower than that of the LRU + L2-RVC in 16 processors, the differences of the execution time are littlein all the applications except OCEAN. In OCEAN, the difference between the L2 hit rate of the LRU+ L2-DAVC and that of the LRU + L2-RVC is large, resulting in much difference in execution time.
L2 cacheorganization
LRU (512 KB) LRU (1 MB) LRU + L2-RVC LRU + L2-DAVC
Number ofProcessors
16 32 16 32 16 32 16 32
BARNES 1.00 1.00 0.99 0.98 0.96 0.94 0.97 0.94CHOLESKY 1.00 1.00 0.99 1.00 0.95 0.97 0.98 1.03FFT 1.00 1.00 0.99 0.94 0.74 0.78 0.74 0.73FMM 1.00 1.00 0.99 0.97 0.92 0.87 0.93 0.88
10
LU 1.00 1.00 0.98 0.99 0.89 0.84 0.91 0.84OCEAN 1.00 1.00 0.55 0.69 0.81 0.82 0.95 0.80RADIX 1.00 1.00 0.96 0.95 0.86 0.76 0.86 0.75WATER-NSQUARED
1.00 1.00 1.00 1.00 0.98 0.94 0.98 0.94
Table 6. Normalized execution time
In OCEAN, the L2 cache hit rate of the LRU + L2-DAVC is lower than that of the LRU + L2-RVCin the 32 processor system. However, the execution time of the LRU + L2-DAVC is less than that ofthe LRU + L2-RVC, though the difference is not so large. This means that total hit counts at the L2-DAVC are less than those at the L2-RVC but the total miss time at the L2-DAVC is less than that atthe L2-RVC.
Strangely, though the L2 cache hit rate of the LRU + L2-DAVC is more than that of the LRU + L2-RVC in CHOLESKY, the execution time of the LRU + L2-DAVC is more than that of the LRU + L2-RVC. Recall that the LRU + L2-DAVC can contain the local blocks, while the LRU + L2-RVC onlycontains the remote blocks. In CHOLESKY, the accesses to the local block are frequent. Thus, inCHOLESKY, the hits at the L2-DAVC do not reduce the remote memory access time but reduce thelocal memory access time, which causes the unexpected results.
The increase of the number of processor leads to the increase of the distances between the nodes. Inthe simulations, the L2-DAVC leads to only small reduction in the execution time. With moreprocessors, the LRU + L2-DAVC is expected to outperform the LRU + L2-RVC since the LRU + L2-DAVC contains the blocks whose distances between the home node and the current node are long.
4. Conclusions and future workFor CC-NUMA multiprocessor systems, we have proposed two alternatives on split L2 caches : the
LRU + L2-RVC and the LRU + L2-DAVC. The proposed split L2 caches are believed to improve theperformance by keeping remote (or long-distance) blocks as well as recently used blocks.
The execution time with the split L2 cache decreases by up to 27%, compared to the traditionalnon-split L2 cache of the same size. We found, moreover, the 512KB split L2 cache could outperformthe double size (1MB) traditional L2 cache in most applications. It is encouraging that the LRU + L2-DAVC performs better than the LRU + L2-RVC as the number of processor increases. It is expectedthat the LRU + L2-DAVC performs much better than the LRU + L2-RVC with more number (64 or128) of processors.
In summary, we conclude that the surplus L2 cache area could be effectively utilized by adoptingour split cache organizations for CC-NUMA multiprocessors. The reason that we compare theproposed split caches with the traditional LRU cache is to make sure that we could utilize thesurplus L2 cache area to enhance the performance without any additional off-chip cache.
We agree that the values of the L2 cache sizes such as 256KB, 512KB, or 1MB are not determinatesince we restrict the simulations to the SPLASH-2 applications. We also admit that the proposed splitcaches may degrade the system performance when cache size is too small to run the application.However, when increasing the LRU cache does not improve the total system performance, theproposed split caches could be relied on as an alternative solution of providing multiprocessorsystems with good scalability.
11
AcknowledgementsWe would like to thank Craig B. Stunkel at the IBM T. J. Watson Research Center for his helpfulcomments.
References[1] S.C. Woo, et. al, “The SPLASH-2 Programs : Characterization and Methodological
Considerations,” International Symposium on Computer Architecture, pp.24-36, June 1995.[2] N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-
associative cache and prefetch buffers,” International Symposium on Computer Architecture, 1990.[3] L. A. Barroso, et.al, “Piranha: a scalable architecture based on single-chip multiprocessing,”
International Symposium on Computer Architecture, 2000.[4] J. E. Veenstra, R. J. Fowler, “MINT : A front end for efficient simulation of shared-memory
multiprocessors,” International Workshop on Modeling, Analysis and Simulation of Computer andTelecommunication Systems, pp.201-207, 1994.
[5] M. Moga and M. Dubois, “The effectiveness of SRAM network caches in clustered DSMs”,International Symposium on High Performance Computer Architecture, 1998.
[6] Hitoshi Oi and N. Ranganathan, “Utilization of cache area in on-chip multiprocessor”,International Symposium on High Performance Computing, 1999.
[7] Daniel Lenoski, et.al, “DASH Prototype: Implementation and Performance”, InternationalSymposium on Computer Architecture, pp. 92-103, 1992.
[8] R. Iyer and L. N. Bhuyan, “Switch cache : A framework for improving the remote memory accesslatency of CC-NUMA multiprocessors,” International Symposium on High PerformanceComputer Architecture, 1999.
[9] A. Gonzalez, Carlos Aliaga and M. Valero, “A data cache with multiple caching strategies tunedto different types of locality,” ACM International Conference on Supercomputing, pp.338-347,1995.
[10] J. Sahuquillo and A. Pont, “Two management approaches for the split data cache inmultiprocessor systems,” 8th Euromicro Workshop on Parallel and Distributed Processing, pp.301-308, 2000.
[11] Silicon Graphics, “SGI Origin 3000 Series : Bricks”, available at http://www.sgi.com/origin/3000/ bricks.html
[12] N. P. Jouppi, S. J. E. Wilton, “Tradeoffs in Two-level On-Chip Caching,” ACM SIGARCHComputer Architecture News, vol.22(2), pp. 34-45, 1994
[13] Compaq Computer Corporation, “21264/EV68CB and 21264/EV68DC Hardware ReferenceManual”, 2001
[14] Intel Corporation, “IA-32 Intel Architecture Software Developer’s Manual”, 2003
12
!"#
!
"# $"%
!
Figure 3. L2 cache hit rate (16 processors)
13
! " #
!"
!"#$"!
%&
Figure 3. L2 cache hit rate (16 processors) - continued
14
Figure 4. Normalized execution time (16 processors)
15
Figure 4. Normalized execution time (16 processors) -continued
16
!"#
$$%
!"#
Figure 5. L2 cache hit rate (32 processors)
!
"!#$%&'(
!
17
!"
Figure 5. L2 cache hit rate (32 processors) - continued
! " #
$% "&'! (
! " #
18
Figure 6. Normalized execution time (32 processors)
19
Figure 6. Normalized execution time (32 processors) - continued
Traditional versus Next-Generation Journaling File Systems:a Performance Comparison Approach
Juan Piernas – Universidad de MurciaToni Cortes – Universitat Politecnica de Catalunya
Jose M. Garcıa – Universidad de Murcia
Abstract
DualFS is a next-generation journaling file system whichhas the same consistency guaranties as traditional journal-ing file systems but better performance. This paper sum-marizes the main features of DualFS, introduces three newenhancements which significantly improve DualFS perfor-mance during normal operation, and presents different ex-perimental results which compare DualFS and other tradi-tional file systems, namely, Ext2, Ext3, XFS, JFS, and Reis-erFS. The experiments carried out cover a wide range ofcases, from microbenchmarks to macrobenchmarks, fromdesktop to server workloads. As we will see, these exper-iments not only show that DualFS has the lowest I/O timein general, but also that it can remarkably boost the overallsystem performance for many workloads (up to 98%).
1 Introduction
High-performance computing environments rely on severalhardware and software elements to fulfill their jobs. Al-though some environments require specialized elements,many of them use off-the-shelf elements which must havea good performance in order to accomplish computing jobsas quickly as possible.
A key element in these environments is the file system.File systems provide a friendly means to store final and par-tial results, from which it is even possible to resume a failedjob because of a system crash, program error, or whateverother fail. Some computing jobs require a fast file system,others a quickly-recoverable file system, and other jobs re-quire both features.
DualFS [16] is a new concept of journaling file systemwhich is aimed at providing both good performance andfast consistency recovery. DualFS, like other file systems[5, 7, 8, 18, 24, 28], uses a meta-data log approach for
This work has been supported by the Spanish Ministry of Science and
Technology and the European Union (FEDER funds) under the TIC2000-1151-C07-03 and TIC2001-0995-C02-01 CICYT grants.
fast consistency recovery, but from a quite different point ofview. Our new file system separates data blocks and meta-data blocks completely, and manages them in very differentways. Data blocks are placed on the data device (a diskpartition) whereas meta-data blocks are placed on the meta-data device (another partition in the same disk). In order toguarantee a fast consistency recovery after a system crash,the meta-data device is treated as a log-structured file sys-tem [18], whereas the data device is divided into groupsin order to organize data blocks (like other file systems do[4, 8, 13, 24, 27]).
Although the separation between data and meta-data isnot a new idea [1, 15], this paper introduces a new versionof DualFS which proves, for the first time, that that sepa-ration can significantly improve file systems’ performancewithout requiring several storage devices, unlike previousproposals.
The new version of DualFS has several important im-provements with respect to the previous one [16]. First ofall, DualFS has been ported to the 2.4.19 Linux kernel. Sec-ondly, we have greatly changed the implementation and re-moved some important bottlenecks. And finally, we haveadded three new enhancements which take advantage of thespecial structure of DualFS, and significantly increase itsperformance. These enhancements are: meta-data prefetch-ing, on-line meta-data relocation, and directory affinity.
The resultant file system has been compared with Ext2[4], Ext3 [27], XFS [22, 24], JFS [8], and ReiserFS [25],through an extensive set of benchmarks, and the experimen-tal results achieved not only show that DualFS is the bestfile system in general, but also that it can remarkably boostthe overall system performance in many cases.
The rest of this paper is organized as follows. Section 2summarizes the main features of DualFS, and introducesthe three new enhancements. In Section 3, we describeour benchmarking experiments, whose results are shown inSection 4. We conclude in Section 5.
1
2 DualFS
In this section we summarize the main features of DualFS,and describe the three enhancements that we added to thenew version of DualFS analyzed in this paper.
2.1 Design and Implementation
The key idea of our new file system is to manage data andmeta-data in completely different ways, giving a specialtreatment to meta-data [16]. Meta-data blocks are locatedon the meta-data device, whereas data blocks are located onthe data device (see Figure 1).
Note that just separating meta-data blocks from datablocks is not a new idea. However, we will show thatthat separation, along with a special meta-data manage-ment, can significantly improve performance without re-quiring extra hardware (previous proposals of separation ofdata and meta-data, such as the multi-structured file sys-tem [15], and the interposed request routing [1], use severalstorage devices). In our experiments, the data device andthe meta-data device will be two adjacent partitions on thesame disk. This is equivalent to having a single partitionwith two areas: one for data blocks and another for meta-data blocks.
Data Device
Data blocks of regular files are on the data device, wherethey are treated as many Unix file systems do. The data de-vice only has data blocks, and uses the concept of group ofdata blocks (similar to the cylinder group concept) in orderto organize data blocks (see Figure 1). Grouping is per-formed in a per directory basis, i.e., data blocks of regularfiles in the same directory are in the same group (or in neargroups if the corresponding group is full).
From the file-system point of view, data blocks are notimportant for consistency, so they are not written syn-chronously and do not receive any special treatment, asmeta-data does [8, 20, 24, 28]. However, they must be takeninto account for security reasons. When a new data blockis added to a file, DualFS writes it to disk before writingits related meta-data blocks. Missing out this requirementwould not actually damage the consistency, but it could po-tentially lead to a file containing a previous file’s contentsafter crash recovery, which is a security risk.
Meta-data Device
Meta-data blocks are in the meta-data device which is orga-nized as a log-structured file system (see Figure 1). It is im-portant to clarify that as meta-data we understand all these
items: i-nodes, indirect blocks, directory “data” blocks,and symbolic link “data” blocks. Obviously, bitmaps, su-perblock copies, etc., are also meta-data elements.
Our implementation of log-structured file system formeta-data is based on the BSD-LFS one [19]. However,it is important to note that, unlike BSD-LFS, our log doesnot contain data blocks, only meta-data ones. Another dif-ference is the IFile. Our IFile implementation has twoadditional elements which manage the data device: thedata-block group descriptor table, and the data-block groupbitmap table. The former basically has a free data-blockcount and an i-node count per group, while the latter indi-cates which blocks are free and which are busy, in everygroup.
DualFS writes dirty meta-data blocks to the log everyfive seconds and, like other journaling systems, it uses thelog after a system crash to recover the file system consis-tency quickly from the last checkpoint (checkpointing isperformed every sixty seconds). Note, however, that oneof the main differences between DualFS and current jour-naling file systems [5, 24, 28] is the fact that DualFS onlyhas to write one copy of every meta-data block. This copyis written in the log, which is used by DualFS both to readand write meta-data blocks. In this way, DualFS avoids atime-expensive extra copy of meta-data blocks. Further-more, since meta-data blocks are written only once andin big chunks in the log, the average write request size isgreater in DualFS than in other file systems where meta-data blocks are spread. This causes less write requests and,hence, a greater write performance [16].
Finally, it is important to note that our file system is con-sidered consistent when information about meta-data is cor-rect. Like other approaches (JFS [8], SoftUpdates [14],XFS [24], and VxFS [28]), DualFS allows some loss of datain the event of a system crash.
Cleaner
Since our meta-data device is a log-structured file system,we need a segment cleaner for it. Our cleaner is startedin two cases: (a) every 5 seconds, if the number of cleansegments drops below a specific threshold, and (b) whenwe need a new clean segment, and all segments are dirty.
At the moment our attention is not on the cleaner, sowe have implemented a simple one, based on Rosenblum’scleaner [18]. Despite its simplicity, this cleaner has a verysmall impact on DualFS performance [16].
2.2 Enhancements
Reading a regular file in DualFS is inefficient because datablocks and their related meta-data blocks are a long way
2
GROUP 2GROUP 1 GROUP N−3 GROUP N−1GROUP N−2GROUP 0 SEGMENT 0 SEGMENT K−1SEGMENT 1
Writes in the log
DATA DEVICE META−DATA DEVICE
Superblock
DISK
Superblock
Figure 1: DualFS overview. Arrows are pointers stored in segments of the meta-data device. These pointers are numbers of blocks inboth the data device (data blocks) and the meta-data device (meta-data blocks).
from each other, and many long seeks are required.In this section we present a solution to that prob-
lem, meta-data prefetching, a mechanism to improve thatprefetching, on-line meta-data relocation, and a third mech-anism to improve the overall performance of data opera-tions, directory affinity.
Meta-data Prefetching
A solution to the read problem in DualFS is meta-dataprefetching. If we are able to read a lot of meta-data blocksof a file in a single disk I/O operation, then disk heads willstay in the data zone for a long time. This will eliminatealmost all the long seeks between the data device and themeta-data device, and it will produce bigger data read re-quests.
The meta-data prefetching implemented in DualFS isvery simple: when a meta-data block is required, DualFSreads a group of consecutive meta-data blocks from disk,where the meta-data block required is. Prefetching is notperformed when the meta-data block requested is alreadyin memory. The idea is not to force an unnecessary disk I/Orequest, but to take advantage of a compulsory disk-headmovement to the meta-data zone when a meta-data blockmust be read, and to prefetch some meta-data blocks then.Since all meta-data blocks prefetched are consecutive, wealso take advantage of the built-in cache of the disk drive.
But, in order to be efficient, our simple prefetching re-quires some kind of meta-data locality. The DualFS meta-data device is a log where meta-data blocks are sequentiallywritten in chunks called “partial segments”. All meta-datablocks in a partial segment have been created or modified atthe same time. Hence, some kind of relationship betweenthem is expected. Moreover, many relationships betweenmeta-data blocks are due to those meta-data blocks belong-ing to the same file (e.g., indirect blocks). This kind of tem-poral and spatial meta-data locality presents in the DualFS
log is which makes our prefetching highly efficient.Once we have decided when to prefetch, the next step is
to decide which and how many meta-data blocks must beprefetched. This decision depends on the meta-data layoutin the DualFS log, and this layout, in turn, depends on themeta-data write order.
In DualFS, meta-data blocks belonging to the same fileare written to the log in the following order: “data” blocks(of directories and symbolic links), simple indirect blocks,double indirect blocks, and triple indirect blocks. Further-more, “data” blocks are written in inverse logical order.Files are also written in inverse order: first, the last modi-fied (or created) file, and last, the first modified (or created)file. Note that the important fact is not that the logical orderis “inverse”; which is important is the fact that the logicalorder is “known”.
Taking into account all the above, we can see that thegreater part of the prefetched meta-data blocks must be justbefore the required meta-data block. However, since somefiles can be read in an order slightly different to the one inwhich they were written, we must also prefetch some meta-data blocks which are located after the requested meta-datablock.
Finally, note that, unlike other prefetching methods [10,11], DualFS prefetching is I/O-time efficient, that is, it doesnot cause extra I/O requests which can in turn cause longseeks. Also note that our simple prefetching mechanismworks because it takes advantage of the unique features ofour meta-data device.
On-line Meta-data Relocation
The meta-data prefetching efficiency may deteriorate due toseveral reasons:
files are read in an order which is very different fromthe order in which they were written.
3
the read pattern can change over the course of the time.
file-system aging.
An inefficient prefetching increases the number of meta-data read requests and, hence, the number of disk-headmovements between the data zone and the meta-data zone.Moreover, it can become counterproductive because it canfill the buffer cache up with useless meta-data blocks. Inorder to avoid prefetching degradation (and to improve itsperformance in some cases), we have implemented an on-line meta-data relocation mechanism in DualFS which in-creases temporal and spatial meta-data locality.
The meta-data relocation works as follows: when a meta-data element (i-node, indirect block, directory block, etc.)is read, it is written in the log like any other just modifiedmeta-data element. Note that it does not matter if the meta-data element was already in memory when it was read or ifit was read from disk.
Meta-data relocation is mainly performed in two cases:
when reading a regular file (its indirect blocks are re-located).
when reading a directory (its “data” and indirectblocks are relocated).
This constant relocation adds more meta-data writes.However, meta-data writes are very efficient in DualFS be-cause they are performed sequentially and in big chunks.Therefore, it is expected that this relocation, even when veryaggressive, does not increase the total I/O time significantly.
Since a file is usually read in its entirety [2], the meta-data relocation puts together all meta-data blocks of the file.The next time the file is read, all its meta-data blocks willbe together, and the prefetching will be very efficient.
The meta-data relocation also puts together the meta-datablocks of different files read at the same time (i.e., the meta-data blocks of a directory and its regular files). If thosefiles are also read later at the same time, and even in thesame order, the prefetching will work very well. This as-sumption in the read order is made by many prefetchingtechniques [10, 11]. It is important to note that our meta-data relocation exposes the relationships between files bywriting together the meta-data blocks of the files which areread at the same time. Unlike other prefetching techniques[10, 11], this relationship is permanently recorded on disk,and it can be exploited by our prefetching mechanism aftera system restart.
This kind of on-line meta-data relocation is in a sensesimilar to the work proposed by Jeanna N. Matthews et al.[12]. They take advantage of cached data in order to reducecleaning costs. We write recently read and, hence, cached
meta-data blocks to the log in order to improve meta-datalocality and prefetching. They also propose a dynamic re-organization of data blocks to improve read performance.However, that reorganization is not an on-line one, and it ismore complex than ours.
Finally, we must explain an implicit meta-data relocationwhich we have not mentioned yet. When a file is read, theaccess time field in its i-node must be updated. If severalfiles are read at the same time, their i-nodes will also beupdated at the same time, and written together in the log. Inthis way, when an i-node is read later, the other i-nodes inthe same block will also be read. This implicit prefetchingis very important since it exploits the temporal locality inthe log, and can potentially reduce file open latencies, andthe overall I/O time. Note that this i-node relocation alwaystakes place, without an explicit meta-data relocation, exceptwhen the file system is read-only mounted.
This implicit meta-data relocation can have an effect sim-ilar to that achieved by the embedded i-nodes proposed byGanger and Kaashoek [6]. In their proposal, i-nodes of filesin the same directory are put together and stored inside thedirectory itself. Note, however, that they exploit the spa-tial locality, whereas we exploit both spatial and temporallocalities.
Directory Affinity
In DualFS, like in Ext2 and Ext3, when a new directory iscreated it is assigned a data-block group in the data device.Data blocks of the regular files created in that directory areput in the group assigned to the directory. DualFS specifi-cally selects the emptiest data-block group for the new di-rectory.
Note that, when creating a directory tree, DualFS has toselect a new group every time a directory is created. Thiscan cause a change of group and, hence, a large seek. How-ever, the new group for the new directory may be onlyslightly better than that of the parent directory. A betteroption is to remain in the parent directory’s group, and tochange only when the new group is good enough, accordingto a specific threshold (for example, we can change to thenew group if it has
% more free data blocks than the par-
ent directory’s group). The latter is called directory affinityin DualFS, and can greatly improve DualFS performance,even for a threshold as little as 10%.
Directory affinity in DualFS is somewhat similar to the“directory affinity” concept used by Anderson et al. [1],but with some important differences. For example, theirdirectory affinity is a probability (the probability that a newdirectory is placed on the same server as its parent) whereasit is a threshold in our case.
4
3 Experimental Methodology
This section describes how we have evaluated the new ver-sion of DualFS. We have used both microbenchmarks andmacrobenchmarks for different configurations, using theLinux kernel 2.4.19. We have compared DualFS againstExt2 [4], the default file system in Linux, and four jour-naling file systems: Ext3 [27], XFS [22, 24], JFS [8], andReiserFS [25].
Ext2 is not a journaling file system, but it has been in-cluded because it is a widely-used and well-understood filesystem. Ext2 is an FFS-like file system [13] which does notwrite modified meta-data elements synchronously. Instead,it marks a meta-data element to be written in 5 seconds.Furthermore, meta-data elements involved in a file systemoperation are modified, and marked to be written, in a spe-cific order (although this order is not compulsory, and de-pends on the disk scheduler). In this way, Ext2 can almostalways recover its consistency after a system crash, withoutsignificantly damaging the performance.
Ext3 is a journaling file system derived from Ext2 whichprovides different consistency levels through mount op-tions. In our tests, the mount option used has been “-odata=ordered”, which provides Ext3 with a behaviorsimilar to that of DualFS. With this option, Ext3 only jour-nals meta-data changes, but flushes data updates to disk be-fore any transactions commit.
Based on SGI’s Irix XFS file system technology, XFS isa journaling file system which supports meta-data journal-ing. XFS uses allocation groups and extent-based alloca-tions to improve locality of data on disk. This results in im-proved performance, particularly for large sequential trans-fers. Performance features include asynchronous write-ahead logging (similar to Ext3 with data=writeback),balanced binary trees for most file-system meta-data, de-layed allocation, and dynamic disk i-node allocation.
IBM’s JFS originated on AIX, and from there was portedto Linux. JFS is also a journaling file system which supportsmeta-data logging. Its technical features include extent-based storage allocation, dynamic disk i-node allocation,asynchronous write-ahead logging, and sparse and densefile support.
ReiserFS is a journaling file system which is speciallyintended to improve performance of small files, to use diskspace more efficiently, and to speed up operations on di-rectories with thousands of files. Like other journaling filesystems, ReiserFS only journals meta-data. In our tests,the mount option used has been “-o notail” which no-tably improves ReiserFS performance at the expense of bet-ter disk-space use.
Bryant et al. [3] compared the above five file systems
using several benchmarks on three different systems, rang-ing in size from a single-user workstation to a 28-processorccNUMA machine. However, there are some important dif-ferences between their work and ours. Firstly, we evaluate anext-generation journaling file system (DualFS). Secondly,we report results for some industry-standard benchmarks(SpecWeb99, and TPC-C). Thirdly, we use microbench-marks which are able to clearly expose performance differ-ences between file systems (in their single-user workstationsystem, for example, only one benchmark was able to showperformance differences). And finally, we also report I/Otime at disk level (this time is very important because re-flects the behavior of every file system as seen by the diskdrive).
3.1 Microbenchmarks
Microbenchmarks are intended to discover strengths andweaknesses of every file system. We have designed sevenbenchmarks:
read-meta (r-m) find files larger than 2 KB in a directorytree.
read-data-meta (r-dm) read all the regular files in a direc-tory tree.
write-meta (w-m) untar an archive which contains a direc-tory tree with empty files.
write-data-meta (w-dm) the same as w-m, but with non-empty files.
read-write-meta (rw-m) copy a directory tree with emptyfiles.
read-write-data-meta (rw-dm) the same as rw-m, butwith non-empty files.
delete (del) delete a directory tree.
In all cases, the directory tree has 4 subdirectories, eachwith a copy of a clean Linux 2.4.19 source tree. In the“write-meta” and “read-write-meta” benchmarks, we havetruncated to zero all regular files in the directory tree. Inthe “write-meta” and “write-data-meta” tests, the untarredarchive is in a file system which is not being analyzed.
All tests have been run 5 times for 1 and 4 processes.When there are 4 processes, each works on its own Linuxsource tree copy.
5
3.2 Macrobenchmarks
Next, we list the benchmarks we have performed to studythe viability of our proposal. Note that we have chosen en-vironments that are currently representative.
Kernel Compilation for 1 Process (KC-1P) resolve de-pendencies (make dep) and compile the Linux kernel2.4.19, given a common configuration. Kernel andmodules compilation phases are made for 1 process(make bzImage, and make modules).
Kernel Compilation for 4 Processes (KC-4P) the sameas before, but just for 4 processes (make -j4 bzImage,and make -j4 modules).
SpecWeb99 (SW99) the SPECweb99 benchmark [23].We have used two machines: a server, with the file sys-tem to be analyzed, and a client. Network is a FastEth-ernet LAN.
PostMark (PM) the PostMark benchmark, which was de-signed by Jeffrey Katcher to model the workload seenby Internet Service Providers under heavy load [9].We have run our experiments using version 1.5 of thebenchmark. In our case, the benchmark initially cre-ates 150,000 files with a size range of 512 bytes to16 KB, spread across 150 subdirectories. Then, it per-forms 20,000 transactions with no bias toward any par-ticular transaction type, and with a transaction block of512 bytes.
TPCC-UVA (TPC-C) an implementation of the TPC-Cbenchmark [26]. Due to system limitations, we haveonly used 3 warehouses. The benchmark is run withan initial 30 minutes warm-up stage, and a subsequentmeasure time of 2 hours.
3.3 Tested Configurations
All benchmarks have been run for the six configurationsshowed below. Mount options have been selected followingrecommendations by Bryant et al. [3]:
Ext2 Ext2 without any special mount option.
Ext3 Ext3 with “-o data=ordered” mount option.
XFS XFS with “-o logbufs=8,osyncisdsync”mount options.
JFS JFS without any special mount option.
ReiserFS ReiserFS with “-o notail” mount option.
Linux PlatformProcessor Two 450 Mhz Pentium IIIMemory 256 MB, PC100 SDRAM
Disk One 4 GB IDE 5,400 RPM Seagate ST-34310A.One 4GB SCSI 10,000 RPM FUJITSUMAC3045SC.SCSI disk: Operating system, swap andtrace log.IDE disk: test disk
OS Linux 2.4.19
Table 1: System Under Test
DualFS DualFS with meta-data prefetching, on-line meta-data relocation, and directory affinity.
Versions of Ext2, Ext3, and ReiserFS are those found ina standard Linux kernel 2.4.19. XFS version is 1.1, and JFSversion is 1.1.1.
All but DualFS are on one IDE disk with only one parti-tion, and their logical block size is 4 KB. DualFS is also onone IDE disk, but with two adjacent partitions. The meta-data device is always on the outer partition, given that thispartition is faster than the inner one [30], and its size is 10%of the total disk space. The inner partition is the data device.Both the data and the meta-data devices also have a logicalblock size of 4 KB.
Finally, DualFS has a prefetching size of 16 blocks, anda directory affinity of 10% (that is, a new-created directoryis assigned the best group, instead of its parent directory’sgroup, if the best group has, at least, 10% more free datablocks). Since the logical block size is 4 KB, the prefetch-ing size is equal to 64 KB, precisely the biggest disk I/Otransfer size used by default by Linux.
3.4 System Under Test
All tests are done in the same machine. The configurationis shown in Table 1.
In order to trace disk I/O activity, we have instrumentedthe operating system (Linux 2.4.19) to record when a re-quest starts and finishes. The messages generated by ourtrace system are logged in an SCSI disk which is not usedfor evaluation purposes.
Messages are printed using the kernel function printk.This function writes messages in a circular buffer in mainmemory, so the delay inserted by our trace system is small( 1%), especially if compared to the time required to per-form a disk access. Later, these messages are read throughthe /proc/kmsg interface, and then written to the SCSIdisk in big chunks. In order to avoid the loss of messages
6
(last messages can overwrite the older ones), we have in-creased the circular buffer size from 16 KB to 1 MB, andwe have given maximum priority to all processes involvedin the trace system.
4 Experimental Results
In order to better understand the different file systems, weshow two performance results for each file system in eachbenchmark:
Disk I/O time: the total time taken for all disk I/O op-erations.
Performance: the performance achieved by the filesystem in the benchmark.
The performance result is the application time except forthe SpecWeb99 and TPC-C benchmarks. Since these mac-robenchmarks take a fixed time specified by the benchmarkitself, we will use the benchmark metrics (number of simul-taneous connections for SpecWeb99, and tpmC for TPC-C)as throughput measurements.
We could give the application time only, but given thatsome macrobenchmarks (e.g, compilation) are CPU-boundin our system, we find I/O time more useful in those cases.The total I/O time can give us an idea of to what extent thestorage system can be loaded. A file system that loads a diskless than other file systems makes it possible to increasethe number of applications which perform disk operationsconcurrently.
Figures below show the confidence intervals for the meanas error bars, for a 95% confidence level calculated usinga Student’s-T statistic. The number inside each bar is theheight of the bar, which is a value normalized with respectto Ext2. For Ext2 the height is always 1, so we have writtenthe absolute value for Ext2 inside each Ext2 bar, which isuseful for comparison purposes between figures.
The DualFS version evaluated in this section includes thethree enhancements described previously. However, youcan look at our previous work [17] if you want to knowthe impact of each enhancement (except directory affinity)on DualFS performance.
Finally, it is important to note that all benchmarks havebeen run with a cold file system cache (the computer isrestarted after every test run).
4.1 Microbenchmarks
Figures 2 and 3 show the microbenchmarks’ results. Aquick review shows that DualFS has the lowest disk I/Otimes and application times in general, in both write and
read operations, and that JFS has the worst performance.Only ReiserFS is clearly better than DualFS in the write-data-meta benchmark, and it is even better when there arefour processes. However, ReiserFS performance is verypoor in the read-data-meta case. This is a serious problemfor ReiserFS given that reads are a more frequent operationthan writes [2, 29]. In order to understand these results, wemust explain some ReiserFS features.
ReiserFS does not use the block group or cylinder groupconcepts like the other file systems (Ext2, Ext3, XFS, JFS,and DualFS). Instead, ReiserFS uses an allocation algo-rithm which allocates blocks almost sequentially when thefile system is empty. This allocation starts at the beginningof the file system, after the last block of the meta-data log.Since many blocks are allocated sequentially, they are alsowritten sequentially. The other file systems, however, havedata blocks spread across different groups which take up theentire file system, so writes are not as efficient as in Reis-erFS. This explains the good performance of ReiserFS inthe write-data-meta test.
The above also explains why ReiserFS is not so bad in theread-data-meta test when there are four processes. Sincethe directory tree read by the processes is small, and it iscreated in an empty file system, all its blocks are togetherat the beginning of the file system. This makes processes’read requests to cause small seeks when processes read thedirectory tree. The same does not occur in the other filesystems, where the four processes cause long seeks becausethey read blocks which are in different groups spread acrossthe entire disk.
Another interesting point in the write-data-meta bench-mark is the performance of ReiserFS and DualFS with re-spect to Ext2. Although the disk I/O times of both file sys-tems are much better than that of Ext2, the application timesare not so good. This indicates that Ext2 achieves a betteroverlap between I/O and CPU operations because it neitherhas a journal nor forces a meta-data write order.
It is specially remarkable the good performance of Du-alFS in the read-data-meta benchmark, where DualFS is upto 35% faster than ReiserFS. When compared with previ-ous versions of DualFS [16], we can see that this perfor-mance is mainly provided by the the meta-data prefetchingand directory affinity mechanisms, which respectively re-duce the number of long seeks between data and meta-datapartitions, and between data-block groups within the datapartition.
In all the metadata-only benchmarks but write-meta, thedistinguished winners (regarding the application time) areReiserFS and DualFS. And they are the absolute winnerstaking into account the disk I/O time. Also note that DualFSis the best when the number of processes is four.
7
1 PROCESS
2.44
1.01
1.29
1.02 1.
13
1.06
1.94
1.07
1.58
1.46
1.06
2.15
0.85
1.36
1.20
2.14
1.05
0.89
0.83
1.80
74.5
8 se
c
253.
07 s
ec
303.
1 se
c
5.35
sec
46.2
8 se
c
81.7
4 se
c
67.1
3 se
c
4.36
5.10
1.49
1.16
0.65
4.70
8.44
4.81
0.28
0.26
0.17 0.20
0.30
0.20
00.
51
1.5
22.
53
w-dm r-dm rw-dm w-m r-m rw-m del
benchmark
No
rmal
ized
Ap
plic
atio
n T
ime
Ext2Ext3XFSJFSReiserFSDualFS
17.8
9
4 PROCESSES
2.37
1.00 1.
15
1.01
1.49
1.01
1.85
1.05 1.
18
1.54
3.10
1.82
3.95
0.63
1.13
1.03
0.99
0.78
0.77
2.06
91.1
2 se
c
66.1
1 se
c
54.0
3 se
c
3.68
sec
640.
70 s
ec
535.
20 s
ec
72.9
7 se
c
0.951.
10
0.65
5.01
0.200.
30
0.16
2.89
0.11
0.30
0.12
00.
51
1.5
22.
53
w-dm r-dm rw-dm w-m r-m rw-m del
benchmark
No
rmal
ized
Ap
plic
atio
n T
ime
Ext2Ext3XFSJFSReiserFSDualFS
6.13
7.99
40.7
3
12.4
0
Figure 2: Microbenchmarks results (application time).
The problem of the write-meta benchmark is that it isCPU-bound because all modified meta-data blocks fit inmain memory. In this benchmark, Ext2 is very efficientwhereas the journaling file systems are big consumers of
CPU time in general. Even then, DualFS is the best of thefive. Regarding the disk I/O time, ReiserFS and DualFS arethe best, as expected, because they write meta-data blocksto disk sequentially.
8
1 PROCESS
1.81
1.01
1.25
1.09
1.01 1.06
1.02
1.54
1.07
1.49
0.50
1.76
4.88
4.16
0.95
1.96
4.81
0.54
1.57
1.09
0.73
0.76
0.70 66
.68
sec
84.3
6 se
c
47.8
6 se
c
22.3
6 se
c
287.
49 s
ec
150.
54 s
ec
95.9
3 se
c 1.40
0.941.
11
6.67
0.15
0.05
0.06
0.04 0.
11
0.10
0.07
0.05
00.
51
1.5
22.
53
w-dm r-dm rw-dm w-m r-m rw-m del
Benchmark
No
mar
lized
I/O
Tim
e
Ext2Ext3XFSJFSReiserFSDualFS
4 PROCESSES
1.69
1.00 1.
14 1.24
1.01 1.
17
1.02
1.35
1.05 1.
18
0.62
1.56
3.15
1.72
3.61
3.82
0.47
1.14
1.02
0.73
0.77
0.76 94
.94
sec
92.1
1 se
c
56.9
9 se
c
22.9
4 se
c
643.
95 s
ec
525.
22 s
ec
107.
32 s
ec
0.84
1.30
0.92
6.55
8.69
0.17
0.080.11
0.05 0.
10
0.09
0.08
0.05
00.
51
1.5
22.
53
w-dm r-dm rw-dm w-m r-m rw-m del
Benchmark
No
rmal
ized
I/O
Tim
e
Ext2Ext3XFSJFSReiserFSDualFS
Figure 3: Microbenchmarks results (disk I/O time).
In the read-meta case, ReiserFS and DualFS have thebest performance because they read meta-data blocks whichare very close. Ext2, Ext3, XFS, and JFS, however, havemeta-data blocks spread across the storage device, which
causes long disk-head movements. Note that the DualFSperformance is even better when there are four processes.
This increase in DualFS performance, when the numberof processes goes from one to four, is due to the meta-data
9
I/O Time (seconds)File System 1 process 4 processes Increase (%)
Ext2 47.86 (0.35) 56.99 (1.13) 19.08Ext3 48.45 (0.29) 57.59 (0.35) 18.86XFS 24.00 (0.66) 35.13 (1.15) 46.38JFS 45.49 (0.46) 98.10 (0.44) 115.65ReiserFS 7.02 (1.44) 9.77 (2.32) 39.17DualFS 5.46 (2.00) 5.90 (3.33) 8.06
Table 2: Results of the r-m test. The value in parenthesis is theconfidence interval given as percentage of the mean.
prefetching. Indeed, prefetching makes DualFS scalablewith the number of processes. Table 2 shows the I/O timefor the six file systems studied, and for one and four pro-cesses. Whereas Ext2, Ext3, XFS, JFS, and ReiserFS sig-nificantly increase the total I/O time when the number ofprocesses goes from one to four, DualFS increases the I/Otime slightly.
For one process, the high meta-data locality in the Du-alFS log and the implicit prefetching performed by the diskdrive (through the read-ahead mechanism) make the dif-ference between DualFS, and Ext2, Ext3, JFS, and XFS.ReiserFS also takes advantage of the same disk read-aheadmechanism. However, that implicit prefetching performedby the disk drive is less effective if the number of processesis two or greater. When there are four processes, the diskheads constantly go from track to track because each pro-cess works with a different area of the meta-data device.When the disk drive reads a new track, the previous readtrack is evicted from the built-in disk cache, and its meta-data blocks are discarded before being requested by the pro-cess which caused the read of the track.
The explicit prefetching performed by DualFS solves theabove problem by copying meta-data blocks from the built-in disk cache to the buffer cache in main memory before be-ing evicted. Meta-data blocks can stay in the buffer cachefor a long time, whereas meta-data blocks in the built-incache will be evicted soon, when the disk drive reads an-other track.
Another remarkable point in the read-meta benchmark isthe XFS performance. Although XFS has meta-data blocksspread across the storage device like Ext2 and Ext3, its per-formance is much better. We have analyzed XFS disk I/Otraces and we have found out that XFS does not update the“atime” of directories by default (we have not specified any“noatime” or “nodiratime” mount options for any file sys-tem). The lack of meta-data writes in XFS reduces the totalI/O time because there are fewer disk operations, and be-cause the average time of the read requests is smaller. JFSdoes not update the “atime” of directories either, but that
does not significantly reduce its I/O time.
In the last benchmark, del, the XFS behavior is odd again.For one process, it has a very bad performance. However,the performance improves when there are four processes.The other file systems have the behavior we expect.
Finally, note the great performance of DualFS in theread-data-meta, and read-meta benchmarks despite the on-line meta-data relocation.
4.2 Macrobenchmarks
Results of macrobenchmarks are showed in Figure 4. Sincebenchmark metrics are different, we have showed the rela-tive application performance with respect to Ext2 for everyfile system.
As we can see, the only I/O-bound benchmark is Post-Mark (that is, benchmark results agree with I/O time re-sults). The other four benchmarks are CPU-bound in oursystem, and all file systems consequently have a similarperformance. Nevertheless, DualFS is usually a little bet-ter than the other file systems. The reason to include theseCPU-bound benchmarks is that they are very common forsystem evaluations.
From the disk I/O time point of view, DualFS has thebest throughput. Only XFS is better than DualFS in theSpecWeb99 and TPC-C benchmarks. However, we mustrespectively take into account that XFS does not update theaccess time of directory i-nodes, and nor does it meet theTPC-C benchmark requirements (specifically, it does notmeet the response time constraints for new-order and order-status transactions).
In order to explain this superiority of DualFS over theother file systems, we have analyzed the disk I/O traces ob-tained, and we have found out that performance differencesbetween file systems are mainly due to writes. There are alot of write operations in these tests, and DualFS is the filesystem which better carries out them. For JFS, however,these performance differences are also due to reads, whichtake longer than in the other file systems.
Internet System Providers should pay special attentionto the results achieved by DualFS in the PostMark bench-mark. In this test, DualFS achieves 60% more transactionsper second than Ext2 and Ext3, twice as many transactionsper second as ReiserFS, almost three times as many transac-tions per second as XFS, and four as many transactions persecond as JFS. Although there are specific file systems forInternet workloads [21], note that these results are achievedby a general-purpose file system, DualFS, which is intendedfor a variety of workloads.
10
-1.0
0
-0.1
3
-0.7
3
-1.1
1-0.8
0
-2.2
2
-3.1
8
-8.4
6
-1.8
7
-3.6
9
0.030.12
1.75
0.00
3.44
-69.
89
0.28
-1.6
7
-1.4
7
0.20
-19.
29
1.59
40.7
4
0.00
-6-4
-20
24
6
KC-1P KC-4P PM SW99 TPC-C
Benchmark
Imp
rove
men
t o
n E
xt2
(%)
Ext3XFSJFSReiserFSDualFS
-129
.01
PM
1.18
1.07
0.96
1.38
1.15
0.89
0.88
1.69
0.37
0.93
2.05 2.
18 2.32
2.16
1.231.
43
0.91
1.18
3.41
1.02
0.68
0.72
0.56
0.46
0.98
54.9
8 se
c
87.8
6 se
c
926.
83 s
ec
124.
92 s
ec
4155
.05
sec
00.
51
1.5
22.
53
3.5
4
KC-1P KC-4P PM SW99 TPC-C
Benchmark
No
mar
lized
I/O
Tim
e
Ext2
Ext3
XFS
JFS
ReiserFS
DualFS
Figure 4: Macrobenchmark results. The first figure shows the application throughput improvement achieved by each file system withrespect to Ext2. The second figure shows the disk I/O time normalized with respect to Ext2. In the TPC-C benchmark, Ext3 and XFSbars are striped because they do not meet TPC-C benchmark requirements.
5 Conclusions
Improving file system performance is important for awide variety of systems, from general purpose systems tomore specialized high-performance systems. Many high-
performance systems, for example, rely on off-the-shelf filesystems to store final and partial results, and to resumefailed computation.
In this paper we have introduced a new version of DualFSwhich proves, for the first time, that a new journaling file-
11
system design based on data and meta-data separation, andspecial meta-data management, is not only possible but alsodesirable.
Through an extensive set of micro- and macrobench-marks, we have evaluated six different journaling and non-journaling file systems (Ext2, Ext3, XFS, JFS, ReiserFS,and DualFS), and the experimental results obtained not onlyhave shown that DualFS has the lowest I/O time in general,but also that it can significantly boost the overall file systemperformance for many workloads (up to 98%).
We feel, however, that the new design of DualFS allowsfurther improvements which can increase DualFS perfor-mance even more. Currently, we are working on extractingread access pattern information from meta-data blocks writ-ten by the relocation mechanism in order to prefetch entireregular files.
Availability
For more information about DualFS, please visit the Website at:http://www.ditec.um.es/˜piernas/dualfs.
References[1] ANDERSON, D. C., CHASE, J. S., AND VAHDAT, A. M.
Interposed request routing for scalable network storage. InProc. of the Fourth USENIX Symposium on Operating Sys-tems Design and Implementation (OSDI) (Oct. 2000), 259–272.
[2] BAKER, M. G., HARTMAN, J. H., KUPFER, M. D.,SHIRRIFF, K. W., AND OUSTERHOUT, J. K. Measurementsof a distributed file system. In Proc. of 13th ACM Symposiumon Operating Systems Principles (1991), 198–212.
[3] BRYANT, R., FORESTER, R., AND HAWKES, J. Filesystemperformance and scalability in linux 2.4.17. In Proc. of theFREENIX Track: 2002 USENIX Annual Technical Confer-ence (June 2002), 259–274.
[4] CARD, R., TS’O, T., AND TWEEDIE, S. Design and imple-mentation of the second extended filesystem. In Proc. of theFirst Dutch International Symposium on Linux (December1994).
[5] CHUTANI, S., ANDERSON, O. T., KAZAR, M. L., LEV-ERETT, B. W., MASON, W. A., AND SIDEBOTHAM, R.The Episode file system. In Proc. of the Winter 1992 USE-NIX Conference: San Francisco, California, USA (January1992), 43–60.
[6] GANGER, G. R., AND KAASHOEK, M. F. Embedded in-odes and explicit grouping: Exploiting disk bandwidth forsmall files. In Proc. of the USENIX Annual Technical Con-ference, Anaheim, California USA (January 1997), 1–17.
[7] HAGMANN, R. Reimplementing the cedar file system usinglogging and group commit. In Proc. of the 11th ACM Sym-posium on Operating Systems Principles (November 1987),155–162.
[8] JFS for Linux. http://oss.software.ibm.com/jfs, 2003.
[9] KATCHER, J. PostMark: A new file system benchmark.Technical Report TR3022. Network Appliance Inc. (october1997).
[10] KROEGER, T. M., AND LONG, D. D. E. Design and imple-mentation of a predictive file prefetching algorithm. In Proc.of the 2001 USENIX Annual Technical Conference: Boston,Massachusetts, USA (June 2001), 319–328.
[11] LEI, H., AND DUCHAMP, D. An analytical approach tofile prefetching. In Proc. of 1997 USENIX Annual TechnicalConference (1997), 275–288.
[12] MATTHEWS, J. N., ROSELLI, D., COSTELLO, A. M.,WANG, R. Y., AND ANDERSON, T. E. Improving the per-formance of log-structured file systems with adaptive meth-ods. In Proc. of the ACM SOSP Conference (October 1997),238–251.
[13] MCKUSICK, M., JOY, M., LEFFLER, S., AND FABRY, R. Afast file system for UNIX. ACM Transactions on ComputerSystems 2, 3 (Aug. 1984), 181–197.
[14] MCKUSICK, M. K., AND GANGER, G. R. Soft updates: Atechnique for eliminating most synchronous writes in the fastfilesystem. In Proc. of the 1999 USENIX Annual TechnicalConference: Monterey, California, USA (June 1999), 1–17.
[15] MULLER, K., AND PASQUALE, J. A high performancemulti-structured file system design. In Proc. of 13th ACMSymposium on Operating Systems Principles (Oct. 1991),56–67.
[16] PIERNAS, J., CORTES, T., AND GARCIA, J. M. DualFS: anew journaling file system without meta-data duplication. InProc. of the 16th Annual ACM International Conference onSupercomputing (June 2002), 137–146.
[17] PIERNAS, J., CORTES, T., AND GARCIA, J. M. Improvingthe performance of new-generation journaling file systemsthrough meta-data prefetching and on-line relocation. Tech-nical Report UM-DITEC-2003-5 (Sept. 2002).
[18] ROSENBLUM, M., AND OUSTERHOUT, J. The design andimplementation of a log-structured file system. ACM Trans-actions on Computer Systems 10, 1 (Feb. 1992), 26–52.
[19] SELTZER, M., BOSTIC, K., MCKUSICK, M. K., AND
STAELIN, C. An implementation of a log-structured file sys-tem for UNIX. In Proc. of the Winter 1993 USENIX Confer-ence: San Diego, California, USA (January 1993), 307–326.
[20] SELTZER, M. I., GANGER, G. R., MCKUSICK, M. K.,SMITH, K. A., SOULES, C. A. N., AND STEIN, C. A. Jour-naling versus soft updates: Asynchronous meta-data protec-tion in file systems. In Proc. of the 2000 USENIX AnnualTechnical Conference: San Diego, California, USA (June2000).
12
[21] SHRIVER, E., GABBER, E., HUANG, L., AND STEIN, C.Storage management for web proxies. In Proc. of the 2001USENIX Annual Technical Conference, Boston, MA (June2001), 203–216.
[22] Linux XFS. http://linux-xfs.sgi.com/projects/xfs, 2003.
[23] SPECweb99 Benchmark. http://www.spec.org, 1999.
[24] SWEENEY, A., DOUCETTE, D., HU, W., ANDERSON, C.,NISHIMOTO, M., AND PECK, G. Scalability in the XFSfile system. In Proc. of the USENIX 1996 Annual TechnicalConference: San Diego, California, USA (January 1996), 1–14.
[25] ReiserFS. http://www.namesys.com, 2003.
[26] TPCC-UVA. http://www.infor.uva.es/˜diego/tpcc.html,2003.
[27] TWEEDIE, S. Journaling the Linux ext2fs filesystem. InLinuxExpo’98 (1998).
[28] Veritas Software. The VERITAS File System (VxFS).http://www.veritas.com/products, 1995.
[29] VOGELS, W. File system usage in Windows NT 4.0. ACMSIGOPS Operating Systems Review 34, 5 (December 1999),93–109.
[30] WANG, J., AND HU, Y. PROFS–performance-oriented datareorganization for log-structured file system on multi-zonedisks. In Proc. of the The Ninth International Symposiumon Modeling, Analysis and Simulation of Computer andTelecommunication Systems (MASCOTS), Cincinnati, OH,USA (August 2001), 285–293.
13
Super-Fast Active Memory Operations-Based Synchronization
Lixin Zhang, Zhen Fang, John B. Carter and Michael A. ParkerSchool of Computing
University of UtahSalt Lake City, UT 84112
lizhang, zfang, retrac, [email protected]
Abstract
Synchronization is a crucial operation in scalable multiprocessors and a fundamental element in parallelprogramming. Various synchronization mechanisms have been proposed over the years and many of themhave enjoyed great success. However, as network latency rapidly approaches thousands of processor cyclesand multiprocessors systems become larger and larger, conventional synchronization techniques are failingto keep up with the increasing demand for fast, efficient synchronization.
In this paper, we propose a mechanism that allows atomic operations on synchronization variables to beexecuted on the home memory controller and the home memory controller to send fine-grained updates towaiting processors when a synchronization variable reaches certain values. Performing atomic operationsnear where the data resides, rather than moving it across the network, operating on it, and moving it back,eliminates significant network traffic, hides high remote memory latencies, and introduces opportunities foradditional parallelism. Using a fine-grained update protocol, rather than a write-invalidate protocol, onsynchronization variables simplifies the programming model, eliminates invalidation requests, and reducesnetwork traffic.
As the first demonstration, we have used a memory-controller-based atomic operation to optimize thebarrier function of an OpenMP library. Our simulation results show that the optimized version outper-forms a conventional LL/SC (Load-Linked/Store-Conditional) version by a factor of 20.8, a conventionalprocessor-based atomic instruction version by 15.5, and an active messages version by 13.4.
Keywords: distributed shared-memory, synchronization, barrier, intelligent memory controller, coherenceprotocol
1 Introduction
Processor speeds are increasing while wall time network latency is decreasing very little due to speed
of light effects. The round trip network latency of large-scale distributed shared-memory (DSM) machines
will soon be thousands of processor cycles. As a result, synchronization overhead has become one of
the major obstacles to the scalability of DSM systems and has limited the sustained performance of DSM
systems. For instance, a barrier, one of the most frequently used synchronization operations, coordinates
1
concurrent processes to guarantee that no process advances beyond a particular point until all processes
have reached that point. Performing a barrier operation on an Origin 3000 system with 32 MIPS R14K
processors requires about 90,000 cycles, during which time the system could execute 5.76 million FLOPS.
The 5.76MFLOPS/barrier ratio is an alarming sign that conventional barrier implementations are hurting the
overall system performance.
A number of barrier solutions have been proposed over the decades. The fastest of all approaches is
to use a pair of dedicated wires between every two nodes [2, 11, 22, 23]. However, such approaches are
only possible for small-scale systems, where the number of dedicated wires is small. For medium-scale
or large-scale systems, the cost of having dedicated wires between every two nodes is prohibiting. For
example, a 128-node system would require 16,256 wires. In addition to the high cost of physical wires,
hardware-wired approaches cannot synchronize more than one barrier at one time and do not interact well
with load-balancing techniques, such as processes migration, where the process-to-processor mapping is not
static.
Most barrier solutions are software-based with limited hardware support. Some processors support atomic
instructions that perform variants of read-modify-write operations, e.g., the semaphore instructions of IA-
64 [7]. Performing atomic semaphore instructions in the processor core complicates the design of the proces-
sor pipeline, load/store units, and system bus. As a simpler alternative, most modern processors, including
AlphaTM [1], PowerPCTM [13] and MIPSTM [8], rely on load-linked/store-conditional (LL/SC) instructions
to implement atomic read-modify-write operations. An LL instruction loads a block of data into the cache.
A subsequent SC instruction attempts to write to the same block. It succeeds only if the block has not been
evicted from the local cache since the preceding LL, which implies that the read-write pair effectively was
atomic. Any intervention request from another processor between the LL and SC instructions causes the
SC to fail. To implement atomic synchronization, library routines typically retry the LL/SC pair repeatedly
until the SC succeeds.
The main drawback of atomic and LL/SC instructions is that they ship data back and forth across the net-
work and system bus for every atomic operation. For instance, in the predominant scalable DSM architec-
tures, directory-based CC-NUMA (cache-coherence non-uniform memory access), each block of memory
is associated with a fixed home node, which maintains a directory structure to track the state of all locally-
homed data. When a processor is about to modify a synchronization variable that has been modified by
another processor, the local DSM hardware sends a message to the barrier’s home node to request a copy
and acquire exclusive ownership. Depending on the dirty copy owner’s location, the home node may need
to send messages to additional nodes to service the request. During a barrier operation, the ownership of the
2
synchronization variable is passed from one processor to another across the network until all participating
processors have atomically incremented the barrier variable.
In an attempt to speed up synchronization operations, we propose adding an Active Memory Unit (AMU)
capable of performing simple atomic operations in the memory controllers (MC). The presence of AMUs
will let processors operate on remote data at the data’s home node, rather than loading it into a local cache,
operating on it, and potentially flushing it back to the home node. We call AMU-supported operations
Active Memory Operations (AMO), because they make the conventional “passive” memory controller
more “active”. The goal of AMOs is to reduce network traffic, avoid the high latency of remote memory
accesses, and ultimately increase the performance of parallel applications.
An existing means of avoiding remote communication is Active Messages [27] (ActMsg). An active
message includes the address of a user-level handler to be executed by the home node processor upon
message arrival using the message body as argument. While active messages are very effective in reducing
communication traffic, using the node’s primary processor to execute the handlers has higher latency than
dedicated hardware and interferes with useful work. With an AMU, similar functionality can be performed in
the MC, which avoids these problems. Although active messages were initially developed for the messaging
passing environment to better overlap communication and computation, we find that they are also effective
in improving the performance of barriers. Thus, we will compare AMOs-based barriers to active-messages-
based barriers in our study.
In general, a barrier synchronization of N processes requires N serial atomic operations. This serialization
exposes the long round trip latency of the network. To enable parallelism during synchronization, some
researchers have proposed barrier trees [5, 21, 28] , which use multiple barrier variables organized as a
tree for one barrier operation. For example, in Yew, Tzeng and Lawrie’s software combining tree [28],
the processors are leaves of the tree and are organized into groups. The last processor to arrive in a group
increments a counter in the group’s parent node. Continuing in this fashion, the last processor reaching the
barrier point works its way to the root of the tree and triggers a reverse wave of wakeup operations to all
processors. Barrier trees achieve significant performance gains on large-scale systems, but they introduce
extra programming complexity. We will later show that AMOs can achieve better performance without
surrendering to this programming complexity.
In the rest of the paper, we present the architecture and software interface for the proposed design and
compare AMOs against LL/SC instructions, processor-side atomic instructions, active messages, and tree-
based implementations of these existing mechanisms. We test the different implementations of the barrier
function of an OpenMP library on an execution-driven simulator. The simulator results show that AMOs
3
outperform LL/SC instructions, atomic instructions, and active messages by up to a factor of 80, with the
geometric mean of 20.8, 15.5, and 13.4, respectively, and their tree-based versions by a factor of 7.8, 7.1
and 6.9, respectively.
The remainder of this paper is organized as follows. Section 2 describes the active memory architectures
and programming models. Section 3 describes the simulation environment and gives the simulator results.
Section 4 surveys related work. Section 5 summarizes our conclusions and discusses future work.
2 AMU-Supported Synchronization
Our proposed mechanism adds an AMU that can perform simple atomic arithmetic operations to the MC,
extends the coherence protocol to support fine-grained updates, and augments the ISA with a few special
AMO instructions that initiate active memory operations. The initial design of the AMU supports three
AMO instructions: amo.inc (increment by one), amo.fetchadd (fetch and add), and amo.compswap
(compare and swap). Semantically, these instructions are just like the atomic instructions implemented
by some microprocessors. Programmers can use them as if they were traditional processor-side atomic
operations.
In the rest of this section, Section 2.1 presents the hardware organization of the AMU. Section 2.2 de-
scribes the fine-grained update protocol. Section 2.3 describes the programming model of AMOs.
Figure 1 depicts the architecture that we assume. A crossbar connects processors to the network back-
plane, from which they can access remote processors, as well as their local memory and IO subsystems,
shown in Figure 1 (a). In our model, the processors, crossbar, and memory controller all reside on the same
die, as will be typical in near-future system designs. Figure 1 (b) is the block diagram of the Active Memory
Controller with the proposed AMU delimited within the dotted box.
When a processor issues an AMO instruction, it examines the target address and sends a request to the
AMU of the target address’ home node. When the message arrives at the AMU of that node, it is placed in
a queue waiting for dispatch. The control logic of the AMU exports a READY signal to the queue when it
is ready to accept another request. The operands are then read and fed to the memory function unit (MFU
in Figure 1 (b)). The old or new value may be returned to the requesting processor, depending on the AMO.
Memory references for synchronization variables exhibit high temporal locality because every participat-
ing process accesses the same synchronization variable. To further improve the performance of AMOs, we
add a very small cache to the AMU. This cache effectively coalesces operations to synchronization variables,
4
xbar
DRAM module
(active) memory controller
& caches
& cachesCPU1 core
CPU2 core
sysF
bus
bus
sysT
(a) Node
DRAM controller
moduleDRAM
normal data path
directory controller
to/from xbar
cache
issue
queue
MFU
AMU
AMU
(b) Active memory unit
Figure 1. Architecture of Active Memory Unit
eliminating the need to load from and write to the off-chip DRAM module every time. Each AMO that hits
in the AMU cache takes only two cycles to complete, independent of the number of nodes contending for
the synchronization variable. The size of the AMU cache is an open design issue. For this study, we assume
a four-entry AMU cache.
AMO instructions operate on coherent data. AMU-generated requests are sent to the directory controller
as fine-grained “get” (for reads) or “put” (for writes) requests 1. A fine-grained “get” loads the coherent
value of a word or a double-word. In our model, the directory controller maintains coherence at the block
level. When it receives a get request from the AMU, it loads the coherent value of the target word either
from a processor cache or local memory, depending on the state of the block containing the word. It then
changes the state of the block to “shared” and adds the AMU to the list of sharers. Unlike traditional data
sharers, the AMU is allowed to modify the word without obtaining exclusive ownership first. The AMU
sends fine-grained “put” requests to the directory controller when it needs to write barrier variables back to1Fine-grained “get/put” operations are planned for a future SGI DSM architecture. Its details are beyond the scope of this paper.
5
local memory. A fine-grained “put” updates a word of a block. When the directory controller receives a
put request, it will send a word-update request to local memory and every node that has a copy of the block
containing the word to be updated.
To take advantage of get/put, the amo.inc instruction includes a “test” value that is compared against
the result of the increment. If the incremented value matches the test value, e.g., as would be the case in a
barrier operation if the test value is the total number of processors expected to reach the barrier, the AMU
sends a put request along with the new value to the directory controller. This update is like a signal to all
waiting processors. It does not have any extra overhead and happens at the earliest time possible.
A potential way to optimize conventional barriers is to use a write-update protocol to eliminate invalida-
tion requests for barrier variables. However, issuing a block update after each increment generates too much
network traffic, offsetting the benefit of eliminating invalidation requests. The get/put mechanism eliminates
write-update’s drawbacks, because it issues word-grained updates (thereby eliminating false sharing) and it
only issues updates after the last process reaches the barrier rather than once each time any processor reaches
the barrier.
One possible problem with using get/put is it introduces temporal inconsistency between the barrier vari-
able values in the processor caches and the AMU cache. In essence, the get/put mechanism implements
release consistency for barrier variables, where the act of reaching a “test” values acts as a release point.
For barrier operations, release-consistency is a completely acceptable memory model, but we must be extra
careful when applying AMOs to applications where release-consistency might cause problems.
The first time a synchronization variable is accessed, the AMU obtains the latest copy from either a
processor cache or local DRAM. All subsequent AMOs to the same synchronization variable will find the
data in the AMU cache and thus require only two cycles to process. The synchronization variable will stay
in the AMU cache until it is evicted out or the directory controller receives a request for exclusive ownership
for the block containing the variable.
AMO instructions are encoded in an unused portion of the MIPS-IV instruction set space, because our
target implementation of AMOs is an as-yet-unannounced future member of the MIPS processor family.
We are considering a wide range of AMO instructions, but for this study, we focus only on amo.inc. The
amo.inc instruction increments the target memory location by one and returns the original memory content
to a general register. It can be used to implement efficient barriers. Its mnemonic form is amo.inc rd,
rs1, rs2, where rd is the destination register to store the return value, rs1 holds a memory address,
6
and rs2 holds the “test” value. The test value is set as the expected value of the barrier variable after all
processes have reached the barrier point.
Before we describe how to use amo.inc to implement barriers, let us first investigate how barriers
are implemented. The following is a naive barrier implementation in which num procs is the number of
participating processes:
atomic_add(& barrier_variable);spin_until(barrier_variable == num_procs);
This implementation is inefficient because it directly spins on the barrier variable. Since processes that
have reached the barrier repeatedly try to read a copy of the barrier variable into their local caches to test if
it has reached the expected value (num procs in this case), the next increment attempt by another process
will have to compete with these read requests, possibly resulting in a very long latency for the increment
operation. Although processes that have reached the barrier can be suspended to avoid interference with the
subsequent increment operations, the overhead of suspending and resuming processes is too much to have
an efficient barrier.
A common optimization to the naive version is to spin on another variable, such as the following.
int expected_value = barrier_counter + num_procs,if (atomic_add_one(&barrier_variable) + 1 == expected_value)
barrier_counter = expected_value;else
spin_until(barrier_counter == expected_value);
The code segment above assumes that atomic add one() returns the value of the barrier variable
before increment. Instead of spinning on the barrier variable, this loops spins on another variable bar-
rier counter. For simplicity, we call barrier counter the “spin” variable. The coding above is
slightly different than the naive coding because it has been extended so that it can be used for more than
one barrier operations after initialization as long as barrier counter and barrier variable do
not overflow. Because data coherence is maintained in block-level, for this coding to work properly, pro-
grammers must make sure that the barrier variable and the spin variable do not reside in the same block.
Using the spin variable eliminates the interference between spin and increment operations. However, it
introduces some extra overhead — a write to the spin variable for each synchronization. A write to the
spin variable involves the home node sending an invalidation request to every processor before the write
occurs and each processor reloading a copy of the spin variable after the write has completed. Introduction
of the spin variable is undesirable because it is not an element in the algorithm but simply a programming
technique of trading programming complexity for performance. Nevertheless, the benefit of using the spin
7
variable often overwhelms its overhead. Nikolopoulos and Papatheodorou [18] have demonstrated that
spinning on the spin variable is 25% faster than spinning on the barrier variable for a synchronization across
64 processors.
With AMOs, atomic increment operations are performed without invalidating shared copies in the pro-
cessor caches and the shared copies in the processor caches are automatically updated when all processes
reach the barrier. Consequently, AMO-based barriers can use the naive coding, i.e., directly spinning on the
barrier variable. The equivalent AMO-based version of barrier looks like the following.
int expected_value = (barrier_variable / num_procs + 1) * num_procs;amo_inc(&barrier_variable, expected_value);spin_until(barrier_variable == expected_value);
In this version, amo inc() is a wrapper function for the amo.inc instruction. Before calling it, the
program must first figure out the expected barrier value after all processes reach the barrier. For simplicity,
we require the initial value of the barrier variable to be a multiple of num procs, such as zero. With
that warrant, the expected value can be computed as ((barrier variable / num procs + 1) *
num procs).
Conventional processor-based atomic instructions often require significant effort from programmers to
write correct and efficient synchronization codes. For example, the Alpha Architecture Handbook [1] uses
three pages to explain each of the LL/SC instructions along with many restrictions. Processor-side imple-
mentation of atomic instructions, like fetchadd and cmpxchg on Itanium2TM [7], need to lock many system
resources to ensure atomicity. This leads to subtle situations that can cause livelocks, e.g., when stores to the
Write Coalescer are closely followed by semaphore instructions. By contrast, AMO instructions are inter-
ference free, do not lock any system resources to ensure atomicity and eliminate the need for programmers
to be aware how the atomic instructions are implemented. Since synchronization-related codes are often the
hardest parts of a parallel application to code and debug, simplifying the programming model is a significant
advantage of AMOs over other mechanisms, in addition to their performance superiorities.
3 Evaluation
This section presents our experimental environment and results. We describe the simulation environment
in Section 3.1, delineate the different barrier synchronization functions that we have tried in Section 3.2, and
present the simulation results in Section 3.3.
8
Parameter Value
Processor four-way issue, 48-entry active list, 2 GHZL1 I-cache two-way, 32KB, 64-byte lines, 1-cycle latencyL1 D-cache two-way, 32KB, 32-byte lines, 2-cycle latencyL2 cache four-way, 2MB, 128-byte lines, 10-cycle latencySystem bus 16-byte CPU to system, 8-byte system to CPU
maximum 16 outstanding cache misses, 1 GHZDRAM backend 16 16-bit-data DDR channelsHub clock rate 500 MHZDRAM latency 60 processor cyclesNetwork hop latency 100 processor cycles
Table 1. Configuration of the simulated systems.
We use the execution-driven simulator UVSIM in our performance study. UVSIM models a future-
generation SGI DSM supercomputer, including a directory-based SN2-MIPS-like coherence protocol [24]
that supports both write-invalidate and write-update. The write-update protocol is fine-grained, meaning
that an update request may replace as little as a single word in a cache line, as described in Section 2.2. Each
simulated node contains two MIPS R18000-like [17] microprocessors connected to sysTF busses [25]. Also
connected to the bus is SGI’s future-generation Hub [25], which contains the processor interface, memory
controller, directory controller, network interface, IO interface, and active memory unit.
UVSIM has a micro-kernel that supports most common system calls. It directly executes statically linked
64-bit MIPS-IV executables. UVSIM supports the OpenMP runtime environment. All benchmark programs
used in this paper are OpenMP-based parallel programs.
Table 1 lists the major parameters of the simulated systems. The L1 cache is virtually indexed and
physically tagged. The L2 cache is physically indexed and physically tagged. The DRAM backend has 16
20-bit channels connected to DDR DRAMs, which enables us to read an 80-bit burst every two cycles. Of
each 80-bit burst, 64 bits are data. The remaining 16 bits are a mix of ECC bits and partial directory state.
The simulated interconnect subsystem is based on SGI’s NUMALink-4. The interconnect is built using
16-port routers in a fat-tree structure, where each non-leaf router has eight children. We model a network
hop latency of 50 nsecs (100 cpu cycles). The minimum network packet is 32 bytes.
We have validated the core of our simulator by setting the configurable parameters to match those of an
SGI Origin 3000, running a large mix of benchmark programs on both a real Origin 3000 and the simulator,
9
Name Description
Baseline uses a pair of LL/SC instructionsActMsg uses active messagesAtomic uses an atomic increment instructionAMO uses the amo.inc instruction
Table 2. Implementations of the barrier function.
and comparing performance statistics (e.g., run time, cache miss rates, etc.). The simulator-generated results
are all within 20% of the corresponding numbers generated by the real machine, most within 5%.
For the initial study of AMOs, we use amo.inc to optimize the barrier function from an OpenMP
library and compare AMOs with LL/SC instructions, processor-side atomic instructions, and active mes-
sages. Table 2 lists the different versions of the barrier function. We use the slowest, and the most widely
used implementation — the LL/SC version — as the baseline. In addition, we employ the software barrier
combining tree based on the work by Yew et al. [28] to each implementation.
We have extended UVSIM to support active messages. In our model, both AMOs and active messages
share the same programming model; the only difference is the way each is handled by the receiving node.
We assume user-level interrupts are supported for active messages on both the recipient processor and the
initiating processor, so our results for active messages are somewhat optimistic.
All programs in our study are compiled using the MIPSpro Compiler 7.3 with an optimization level of “-
O3”. All results in this section come from complete simulations of the benchmark programs. Each program
invokes the synchronization function 11 times. The first call is used to synchronize the starting point of each
process. We then divide the wall time for the following ten invocations by ten, report this time as the elapsed
time for a barrier operation.
Non-tree-based barriers
Table 3 presents the speedups of different barrier implementations over the baseline implementation. The
first column of this table contains the numbers of processors being synchronized. We vary the number of
processors from four (i.e., two nodes) to 256, the maximum number of processors allowed by the directory
10
CPUs Speedup over BaselineAtomic ActMsg AMO
4 1.25 1.16 3.068 1.27 1.57 9.26
16 1.25 1.36 14.0432 1.27 1.53 23.7164 1.37 1.69 42.59
128 1.37 1.81 50.59256 1.41 1.86 82.30
Table 3. Performance of different barriers.
4 8 16 32 64 128 256
Number of Processors
100
500
1000
1500
2000
2500
3000
Cyc
les
/ Pro
cess
or
BaselineAtomic ActMsgAMO
Figure 2. Cycles-per-processor of different barriers.
structure of the SN2-MIPS protocol [24]. The Atomic, ActMsg, and AMO barrier implementations all
achieve significant speedups over the baseline LL/SC version. Specifically, conventional atomic instructions
outperform LL/SC by a factor of 1.25 to 1.41. Active messages outperform LL/SC by a factor 1.16 to 1.86.
However, AMO-based barrier performance dwarfs that of all other implementations, ranging from a speedup
of 3.06 for four processors to as high as 82.30 for 256 processors.
In the baseline implementation, each processor loads the barrier variable into its local cache before in-
crementing it using LL/SC instructions. Only one processor will succeed at one time; the LL/SCs on the
other processors will fail. After a successful update by a processor, the barrier variable will move to another
processor, and then to another processor, and so on. As the system grows, the average latency to move the
barrier variable between processors increases, as does the amount of contention. As a result, the synchro-
nization time in the base version increases superlinearly as the number of nodes increases. This effect can
be seen particularly clearly in Figure 2, which plots the per-processor barrier synchronization time for each
barrier implementation.
11
The conventional atomic instruction implementation of barriers eliminates the failed SC attempts in the
baseline version, and thus improves upon the baseline by more than 25% in all cases. However, its perfor-
mance gains are relatively small compared to the other implementations that we considered, because it still
requires a round trip over the network for every atomic operation, all of which must be performed serially.
For the active message version, an active message is sent for every increment operation. The overhead
of invoking the active message handler for each increment operation dwarfs the time required to run the
handler itself. Nonetheless, the benefit of eliminating remote memory accesses outweighs the high invoca-
tion overhead, which results in the active message version of barriers outperforming the baseline version for
all systems by from 16% to 86%. In addition, the high overhead of each invocation results in a significant
number of messages being queued, which leads to message timeouts and retransmissions. These drawbacks
make active messages perform relatively poorly comparing to AMOs.
The AMO implementation of barriers uses the special amo.inc instruction. After issuing an AMO, each
process spins on the barrier variable, waiting for it to reach the prescribed value. When the first amo.inc
operation arrives at the barrier variable’s home AMU, local DRAM or a processor cache is accessed to load
the barrier variable into the AMU’s cache. All subsequent AMOs find the data in the AMU cache and thus
require only two cycles to process. Thus, the bottleneck for AMO performance is the time to transfer the
request from the network backplane. After the barrier variable reaches a specified value, an update is sent
to every node that has a copy of the variable. Since a local process is spinning on the data, it is very likely
that every node will have a copy of the barrier variable in its local caches. Thus, the total cost of sending
update requests is close to the time required to send a single update request multiplied by the number of
participating processors. 2
Roughly speaking, the time to perform an AMO-based barrier equals ( ), where is a fixed
overhead and is a small value related to the processing time of an amo.inc operation and an update
request, and is the number of processors being synchronized. This expression implies that AMO-based
barriers scale well, which is clearly illustrated by the performance numbers in Figure 2. This figure shows
that the per-processor latency of AMO-based barriers is constant with respect to the total number of pro-
cessors. In fact, the per-processor latency drops off slightly as the number of processors increases for
AMO-based barriers because the fixed overhead is amortized by more processors. Contrast this with the
other implementations, where the larger the number of nodes in the system, the higher the per-processor
synchronization time.2We do not assume that the network can physically multicast updates; AMO performance would be even higher if the network
supported such operations.
12
CPUs Speedup over BaselineBaseline+tree Atomic+tree ActMsg+tree AMO+tree AMO
8 1.04 1.11 1.19 1.32 9.2616 2.06 2.17 2.08 2.10 14.0432 2.42 2.97 3.57 3.82 23.7164 4.80 5.46 6.74 9.06 42.59
128 6.28 6.17 8.16 11.35 50.59256 10.03 12.08 13.51 14.90 82.30
Table 4. Performance of tree-based barriers.
8 16 32 64 128 256
Number of Processors
50
100
150
200
250
300
350
400
450
500
550
Cyc
les
/ Pro
cess
or
Baseline + tree Atomic + treeActMsg + treeAMO + treeAMO
Figure 3. Cycles-per-processor of tree-based barriers.
Tree-based barriers
For all tree-based barriers, we use a two-level tree structure regardless of the number of processors. For each
configuration, we try all possible tree branching factors and use the one that delivers the best performance.
The initialization time of the tree structures is not included in the reported results. Since tree barrier algo-
rithms are designed for non-trivial system sizes, the smallest configuration that we consider in this study is
8 processors. Table 4 shows the speedups of tree-based barriers over the original baseline implementation.
Figure 3 shows the number of cycles per processor for the tree-based barriers.
Our simulation results clearly indicate that tree-based barriers perform much better and scale much better
than normal barriers, which concurs with the findings of Michael et al. [15]. Tree-based barriers create
parallelism and reduce the hot-spot contention, so they are every effective at speeding up normal barriers.
On a 256-processor system, all tree-based barriers are at least ten times faster than the baseline barrier.
As seen in Figure 3, the number of cycle-per-processor for tree-based barriers decreases as the number
of processors increases, because the high overhead associated with using trees is amortized across more
processors and increments to different parts of the tree can proceed in parallel.
13
A tree-based barrier on a large system is basically a combination of several barriers on smaller systems.
Traditional barriers have similar performance on small-scale systems, so the differences among traditional
tree-based barriers are rather small.
One reason that tree-based barriers perform uniformly well is because we have tried different tree branch-
ing factors and present the best result for each given system. In a real system, we might not have the luxury
of choosing the optimal branching factors and tree re-construction could add non-negligible overhead. The
best branching factor for a given system is often not intuitive. Markatos et al. [12] have demonstrated that
in some multiprogramming environments, improper use of trees can drastically degrade the performance of
tree-based barriers to even below that of simple centralized barriers. Nonetheless, our simulation results do
demonstrate of the performance potential of combining tree-based barriers.
Even with all of the advantages of tree structures, tree-based barriers are still significantly slower than
AMO-based barriers. For instance, the best tree-based barrier (ActMsg + tree) is more than six times slower
than the AMO-based barrier on a 256-processor system. This result is because tree-based barriers involve
higher overhead and generate more network traffic than AMO-based barriers.
Interestingly, the combination of AMOs and trees performs worse than AMOs alone for this size configu-
ration. The cost of an AMO-based barrier includes a large fixed overhead and a very small number of cycles
per processor. Using tree structures with AMO-based barriers essentially introduces the fixed overhead more
than once, therefore resulting in a longer barrier synchronization time. That AMOs alone are better than the
combination of AMOs and trees for this configuration is another indication that AMOs do not require heroic
programming effort to achieve good performance. However, this relationship between normal AMOs and
tree-based AMOs might change if we move to systems with tens of thousands processors. Determining
whether or not tree-based AMO barriers can provide extra benefits on such extremely large systems is part
of our future work.
4 Related Work
Physical-wire-based barriers have had great success in small- to medium-scale systems. Lundstrom [11]
proposes an ANDing network for barrier synchronization. Shang et al. [23] propose a distributed wired-
NOR network. Cray T3D [2] implements barriers with hierarchically constructed AND trees and fanout
circuits. Cray T3E [22] implements barrier trees using dedicated barrier/Eureka synchronization units, but
without using dedicated physical wires.
The fetch-and-add instructions in the NYU Ultracomputer [4, 9] are similar to AMOs in that both are
implemented in the memory controller. The Ultracomputer’s fetch-and-add uses a combining network that
14
tries to combine all loads and stores for the same memory location within the routers. It has the potential
drawback of delaying requests that do not use this feature. Combining is effective only when many fetch-
and-add instructions are issued over a very short time interval. In contrast, our approach performs the
combining at the barrier variable’s home memory controller, which eliminates the closeness-in-time problem
associated with combining in the network, and requires only two DRAM access per barrier regardless of the
number of processes.
The SGI Origin 2000 [10] and Cray T3E’s [22] atomic memory operations come closest to our work.
Like AMOs, atomic memory operations are implemented at the memory controller. Processors trigger
atomic memory operations by sending requests to special IO addresses on the home memory controller
of the atomic variable. The home MC interprets those special requests as commands to perform simple
atomic operations. With atomic memory operations, programs can either use uncached loads or, for better
performance, use spin variables during spinning. AMOs do not require spin variables, and thus are superior
to atomic memory operations in both performance and programming convenience. Comparing AMOs with
atomic memory operations is part of future work.
Michael et al. [16] have studied several implementations of the LL/SC instructions. In their studies,
computation power is placed at different levels of the memory hierarchy. They do so by carefully choosing
coherence policies for synchronization operations. For the computation-in-memory environment, a cen-
tralized reservation mechanism is used for granting the exclusive ownership to one of the processors that
attempt to do SC on a synchronization variable. This mechanism is similar to the LLbit approach in SGI
MIPS processors [17]. It needs to maintain a bit vector or a counter for each memory location that the
synchronization variable may reside. Its performance gain is likely smaller than AMOs’ for three reasons:
(i) every LL or SC must go to memory for the bit vector or counter; (ii) SC attempts fail frequently due to
heavy contention; and, (iii) updates that are sent to the sharers after each successful write generate too much
traffic and jam the interconnection network.
Adding intelligence to the memory controller is an approach taken by several researchers working to
overcome the “Memory Wall” problem [3, 14, 29]. Several researchers propose to incorporate processing
power on DRAM chips, so-called PIM (processor-in-memory) systems [6, 19, 20, 26]. Unlike PIMs, we do
not propose placing processors on the DRAM chips themselves. We believe that non-commodity DRAMs,
with their expected lower yields, lower density, and higher per-byte costs, are not well-suited for cost-
effective large-scale machines. Further, the performance of the PIM architectures is somewhat limited by
the fact that DRAM processes are slower than logic processes, and it is difficult to enforce data coherence
in PIM architectures.
15
5 Conclusions and Future Work
Efficient synchronization is crucial operation for effective paralle programming of large multiprocessor
systems. As network latency rapidly approaches thousands of processor cycles and multiprocessors sys-
tems are becoming larger and larger, synchronization speed is quickly becoming a significant performance
determinant.
In this paper, we propose a mechanism that uses special atomic active memory operations performed by
the home node memory controller of a synchroniation variable to achieve super-fast barrier synchroniza-
tion. AMO-based barriers do not require extra spin variables or complicated tree structures to achieve good
performance. AMOs enable extremely efficient barrier synchronization at rather low hardware cost, with a
simple programming interface. Our initial simulation results indicate that AMOs are much more efficient
than the traditional LL/SC instructions, atomic instructions, and active messages, outperforming them by up
to a factor of 80.
AMU development is still in its infancy. Part of our future work is to compare AMOs to the other forms
of off-loaded atomic memory operations described in Section 4. We also need to test the sensitivity of
AMOs to network latency and investigate how to use AMOs to improve other synchronization primitives,
e.g., spinlocks and wait/signal. We are also considering how and whether to expand the AMU to support
more complicated operations like floating-point operations, vector-like operations that operate on a range of
data elements, and collective communication operations. For data with insufficient reuse to warrant moving
it across the network, using AMOs may significantly reduce network traffic, hide the long network latency,
and thus improve the overall performance of a variety of parallel operations.
References
[1] Compaq Computer Corporation. Alpha architecture handbook, version 4, Feb. 1998.
[2] Cray Research, Inc. Cray T3D systems architecture overview, 1993.
[3] W. Donovan, P. Sabella, I. Kabir, and M. Hsieh. Pixel processing in a memory controller. IEEE Graphics andApplications, 15(1):51–61, Jan. 1995.
[4] A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU multicomputer -designing a MIMD shared-memory parallel machine. IEEE TOPLAS, 5(2):164–189, Apr. 1983.
[5] R. Gupta and C. Hill. A scalable implementation of barrier synchronization using an adaptive combining tree.International Journal of Parallel Programming, 18(3):161–180, June 1989.
[6] M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, J. Brockman, A. Srivastava,W. Athas, and V. Freeh. Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. InSC’99, Nov. 1999.
[7] Intel Corp. Intel Itanium 2 processor reference manual. http://www.intel.com/design/itanium2/manuals/25111001.pdf.
[8] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, 1992.
16
[9] C. Kruskal, L. Rudolph, and M. Snir. Efficient synchronization on multiprocessors with shared memory. ACMTOPLAS, 10(4):570–601, October 1988.
[10] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In ISCA97, pp. 241–251, June1997.
[11] S. Lundstrom. Application considerations in the system design of highly concurrent multiprocessors. IEEETrans. on Computers, C-36(11):1292–1390, Nov. 1987.
[12] E. Markatos, M. Crovella, P. Das, C. Dubnicki, and T. LeBlanc. The effect of multiprogramming on barriersynchronization. In Proc. of the 3rd IEEE Symposium on Parallel and Distributed Processing, pp. 662–669,Dec. 1991.
[13] C. May, E. Silha, R. Simpson, and H. Warren. The PowerPC Architecture: A Specification for a New Family ofProcessors, 2nd edition. Morgan Kaufmann, May 1994.
[14] S. McKee and et al. Design and evaluation of dynamic access ordering hardware. In Proc. of the 1996 ICS, pp.125–132, May 1996.
[15] M. M. Michael, A. K. Nanda, B. Lim, and M. L. Scott. Coherence controller architectures for SMP-basedCC-NUMA multiprocessors. In Proc. of the 24th ISCA, pp. 219–228, June 1997.
[16] M. M. Michael and M. L. Scott. Implementation of atomic primitives on distributed shared memory multipro-cessors. In Proc. of the First HPCA, pp. 222–231, Jan. 1995.
[17] MIPS Technologies, Inc. MIPS R18000 Microprocessor User’s Manual, Version 2.0, Oct. 2002.
[18] D. S. Nikolopoulos and T. A. Papatheodorou. The architecture and operating system implications on the per-formance of synchronization on ccNUMA multiprocessors. International Journal of Parallel Programming,29(3):249–282, June 2001.
[19] M. Oskin, F. Chong, and T. Sherwood. Active pages: A model of computation for intelligent memory. In Proc.of the 25th ISCA, pp. 192–203, June 1998.
[20] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keaton, C. Kozyrakis, R. Thomas, and K. Yelick. A casefor Intelligent RAM: IRAM. IEEE Micro, 17(2), Apr. 1997.
[21] M. Scott and J. Mellor-Crummey. Fast, contention-free combining tree barriers for shared memory multiproces-sors. International Journal of Parallel Programming, 22(4), 1994.
[22] S. Scott. Synchronization and communication in the T3E multiprocessor. In Proc. of the 7th ASPLOS, Oct. 1996.
[23] S. Shang and K. Hwang. Distributed hardwired barrier synchronization for scalable multiprocessor clusters.IEEE Trans. on Parallel and Distributed Systems, 6(6):591–605, June 1995.
[24] Silicon Graphics, Inc. SN2-MIPS Communication Protocol Specification, Revision 0.12, Nov. 2001.
[25] Silicon Graphics, Inc. Orbit Functional Specification, Vol. 1, Revision 0.1, Apr. 2002.
[26] T. Sunaga, P. M. Kogge, et al. A processor in memory chip for massively parallel embedded applicatiions. IEEEJournal of Solid State Circuits, pp. 1556–1559, Oct. 1996.
[27] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active messages: A mechanism for integrated commu-nication and computation. In Proc. of the 19th ISCA, pp. 256–266, May 1992.
[28] P. Yew, N. Tzeng, and D. Lawrie. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans.on Computers, C-36(4):388–395, Apr. 1987.
[29] L. Zhang, Z. Fang, M. Parker, B. Mathew, L. Schaelicke, J. Carter, W. Hsieh, and S. McKee. The impulsememory controller. IEEE Trans. on Computers, 50(11):1117–1132, Nov. 2001.
17
Dynamic Code Repositioning for Java
Shinji Tanaka, Tetesuyasu Yamada, and Satoshi ShiraishiNetwork Service Systems Laboratories
NTT9-11 Midori-Cho, 3-Chome
Musashino-Shi, Tokyo, Japanemail: tanaka.shinji, yamada.tetsuyasu, [email protected]
Abstract
The sizes of recent Java-based server-side applications, likeJ2EE containers, have been increasing continuously. Pasttechniques for improving the performance of Java applica-tions have targeted relatively small applications. Moreoverthe frequency of invoking the methods of a target applica-tion is not normally distributed. As a result, these tech-niques cannot be applied efficiently to improve the perfor-mance of current large applications. We thus propose a dy-namic code repositioning approach to improve the hit rates ofinstruction caches and TLBs. Profiles of method invocationsare collected when the application performs with its heav-iest processor load, and the code is repositioned based onthese profiles. We also discuss a method splitting techniqueto significantly reduce the sizes of methods. Our evaluationof a prototype implementing these techniques, indicated animprovement 5% in the throughput of the application.
keyword: Java, Code repositioning, Dynamic compiler,
Cache and TLB performance, Large-scale application.
1 Introduction
Recently, the use of Java has been spreading widely ,especially among server-side systems, providing web ser-vices. The code size of applications running on Javavirtual machines (JavaVMs) on these kinds of servers istypically large. Current JavaVMs utilize dynamic com-pilers like Hotspot [8] and JIT [11] [6]. These compilersconvert bytecodes into native instructions and performvarious optimizing techniques, including inline methodexpansion. They do not, however, consider the totalfootprint of an application’s working set of code. Unfor-tunately, large footprints reduce the hit rates of instruc-tion caches and translation look-aside buffers (TLBs).
Current processors have very long pipelines to improvetheir performance, so that a cache miss hit causes a crit-ical latency. These processors include mechanisms toimprove cache hit rates, such as branch prediction, in-struction prefetch, and so on, but these mechanisms arelimited to certain processes.
Many researchers have focused on improving the hitrates of instruction caches and TLBs from the applica-tion’s point of view. Pettis [9] pioneered such code repo-sitioning techniques. The goal of code repositioning is
to minimize the footprint of an application’s working setof code, thus improving the hit rates of the instructioncaches and the TLBs. This work has focused on appli-cations written in Fortran and Pascal, while Tamaches[12] have focused on kernel functions.
There are also numerous techniques for optimizingJava applications dynamically [1] [2] [10]. We thus ap-plied a code repositioning technique for a JavaVM, andwe set up an application server as the target server-sideapplication.
The novel contributions of this paper are the following.
Profiling Algorithm We describe how to gather theprofile information used in the a code repositioningtechnique. Because the life cycles of large applica-tions like those running on application servers haveseveral stages, (i.e. startup, warm up, and stableprocessing), the profile information should be theone gathered in the stage with the heaviest proces-sor loads.
Evaluation We evaluated our prototype designed forthe IBM JIT, a widely used, high-performingJavaVM. Our evaluation, indicated a 5% improve-ment in the throughput of a target application run-ning on the application server.
The rest of this paper is organized as follows. Section 2describes the dynamic code repositioning approach. Sec-tion 3 discusses the prototype evaluation of our proto-type, and Section 4 presents our conclusions.
2 Dynamic Code Repositioning
Current JavaVMs compile methods that are frequentlyinvoked. A typical JavaVM has a counter for eachmethod, and increments it when the method is invoked.When a counter exceeds a threshold, the JavaVM com-piles the method corresponding to the counter. There-fore, the order in which methods are compiled is notrelated to the order of invocation, and also frequentlyexecuted code and rarely executed code are intermixedwithin individual methods. This approach results in lowhit rates for instruction caches and TLBs.
1
Our proposal is to increase the hit rates of instructioncaches and TLBs by repositioning the compiled code.The principles of code repositioning are as follows.
1. Decrease the distance between frequently calledmethods and the corresponding frequent callermethods.
2. Put rarely executed code in methods farther away.
With code repositioning, the numbers of instructioncache lines stored in caches and memory pages occupiedby frequently executed code can be reduced. In thissection, we discuss how to implement these principles.
2.1 Online Profiling
To determine an optimized positioning for applicationcode, statistical profiles of the frequencies of methodinvocation are useful, because frequently invoked callerand callee methods should be placed close to each other.This section describes how profiles for application codeare aggregated and how the application code is reposi-tioned.
2.1.1 Life Cycle of an Application
Our target application, an application server, has a lifecycle, which consists of a start-up period initializing thesystem, a warm up period with few requests, and astable processing period with more numerous requests.Therefore, the processor load for processing requests andthe frequency of compiling application code dynamicallyvary with the stage of the system is in its life cycle.
Startup Warmup Stable
Time
Processor load
Compiling frequency
Figure 1: Life cycle of the application server
Figure 1 shows the life cycle of the system. The x-axisrepresents time, and the y-axis represents the processorload and the frequency of dynamic compiling. The threestages in the life cycle are described in more detail asfollows.
Start-up The first stage is start up, in which the appli-cation loads its configuration and components, ini-tializes its subsystems, and activates them. Becausemost of the application code executed in this stageapplies only to initialization, there is little benefitin optimizing it.
Warm-up The second stage is warm-up, in which theapplication receives only a small number of requests.The throughput at the beginning of this stage is rel-ativity low, because most of the code for processingrequests has not yet been compiled or placed un-der dynamic compiling. As time passes, the rateof compiling code for processing requests increases,along with the throughput. In contrast, the rate ofdynamic compiling decreases with time.
Stable processing The last stage is stable processing,in which the application processes a large numberof requests and its throughput is maximal. Mostof the application code executed in this stage hasalready been compiled, and the dynamic compilerof the JavaVM is rarely executed.
To increase the maximum throughput, the optimiza-tion targets for dynamic code repositioning are the por-tions of code executed in the stable processing periodwith a high processor load and low frequency of dynamiccompilation.
2.1.2 Execution Profiling
To acquire a profile during the stable processing stage,a thread, called control thread, periodically observes thefrequency of compilation. The JIT increments a com-pilation counter when it compiles codes dynamically. Ifthe control thread finds that the counter is less than thethreshold, it starts a profiling process. If not, it sleepsfor a while.
The profiling process has three stages as follows.
Initialization The control thread inserts code invokinga profiling method into the prologue code of all thecompiled methods. Then the control thread sleepsfor the duration of the profiling period, and the com-piled code collects profile information on its own.
Profiling The profiling code is invoked with argumentsconsisting of the program counter of the callermethod and the profile information of the calleemethod. The profiling code includes a profiling ta-ble, which contains a pair consisting of the programcounter of the caller method and the unique ID ofthe callee method, and a counter for the numberof invocations. When invoked, the profiling codesearches the profiling table for the caller method’sprogram counter and the callee method’s ID. If theyare not found, the profiling code registers the newpair in the profiling table. If they are found, it in-crements the pair’s counter.
Finalization After a period of profiling, the controlthread awakes and reinserts the prologue code tothe original code, in order to stop profiling.
The more samples are gathered, the more accurate theinformation. If the number of samples gathered during
2
the profiling period is below the threshold, the systemdoes not optimize the code but repeats the profiling pro-cess instead.
2.2 Code Repositioning
After the profiling process, the system decides whichmethods should be placed with corresponding algorithmsand where, based on a “closest is best” strategy[9]. Thissection describes how the system determines the posi-tioning of methods and repositions them.
2.2.1 Dynamic Call Graph
After the system gathers a sufficient number of samples,it creates an undirected weighted dynamic call graphbased on the profiling table, which stores the results ofthe profiling process (Figure 2). Each node of the graphrepresents a caller or callee method, and each edge rep-resents a call between two methods. The weight of theedge represents the number of times the call was made.In this graph recursive calls are ignored, because theycannot be optimized by repositioning code.
2.2.2 Sorting by Invocation Frequency
E
100
200
D
CBA
1
1000400
(1)
E
300 CB,DA
1
400
(2)
300 CB,DA,E 1
(3)
C(A,E),(B,D)
1
(4)
((A,E),(B,D),C)
(5)
Figure 2: Dynamic call graph
Because our algorithm is based on a “closest is best”strategy[9], it first chooses the edge with the heaviestweight in the graph. If multiple edges have the sameweight, one is arbitrarily chosen. The two methods cor-responding to the two nodes connected by the chosenedge will be placed next to each other in the final po-sition of the code. Then, the two nodes in the graphare merged into one node. If these nodes have edges tosame node, these edges are also merged into one edgewhose weight is the sum of the weights of each node.When the nodes are merged, the new node representingthe merge is attached to a binary tree (Figure 3). Thenew node includes information consisting of the weights
of the edge between the original nodes, and the originalnodes become child nodes of the new node. This processis repeated until the whole graph contains no edges.
Because every node in the dynamic call graph can betraced from a node representing “main” method, thatwas invoked by the JavaVM at first, the graph is a con-nected graph. Therefore the graph will contain only asingle node at the final step.
(a) (b)
(c)
(d)
B D B D A E
A EB D
A EB D
1,000 4001,000
1,000 400
300
1,000 400
300 C
1
Figure 3: Binary tree
Finally, the position of the methods is determined bya depth-first search, that is a algorithm which traces aheavier node first. The algorithm recognizes each leafrepresenting a method is lighter than a node representinga merged node. If both child nodes represent methods,one is arbitrarily chosen.
Figures 2 and 3 show examples of dynamic call graphsand the corresponding binary trees, respectively. Theedge connecting methods B and D has the heaviestweight in the graph (1) of Figure 2, so methods B and Dwill be placed next to each other in the final positions ofthe methods. Then, a node representing this merge witha weight of 1000 is attached to the binary tree shown in(a) of Figure 3 and nodes representing methods B and Dare attached to the merge node. This process is repeateduntil the graph has no edges.
After inserting the merge nodes, the binary tree istraced by a depth-first search. The final positions of
3
methods becomes B-D-A-E-C in this example.
2.2.3 Repositioning
In a typical application, many methods are invoked onlyfew times. To reduce the cost of dynamic code repo-sitioning, these edges for these less frequent calls areremoved from the set of target methods. The controlthread removes methods whose ratios of method invo-cation to total invocations are below a threshold aftercalculating the positions of methods. Finally the systemrecompiles the methods in the sorted order and insertsthe native codes.
2.3 Method splitting
Both frequently executed code and rarely executed codeoften exist within one method. It is thus more efficient toput frequently executed code close to a caller method andrarely executed codes farther away. The system treatstwo types of codes, exception code and inline compen-sating code, as rarely executed, and it puts them in arare code region, which is separate from the usual regionin which frequently executed code is placed.
2.3.1 Exception Code Splitting
Exception mechanisms characterize Java. The exceptioncode contains the control flows that create and throw ex-ception objects and the control flows that catches excep-tion objects and return them to the original sites afterthey are processed. These portions of code are executedrelatively rarely. The system analyzes the control flowsof methods and finds those flows that process exceptionsand create exception handlers. The system treats the ba-sic blocks of these control flows as rarely executed code.
2.3.2 Inline Compensating Code Splitting
The IBM JIT optimizes code by inline method expan-sion. While the application code is running, it looks for-ward to find non-overwritten methods among the com-piled methods by analyzing a class hierarchy. It per-forms inline method expansion for the methods foundand creates compensating code. The compensating codeperforms the original virtual method calls, and then re-turns to the original control flow.
If the expanded inline methods are overwriten, theJIT replaces the expanded inline with code calling thecompensating code. With this algorithm, the JIT canoptimize the virtual method calls and ensure correct be-havior.
The system treats the compensating code as rarelyexecuted.
3 Evaluation
This section gives the results for an experimental eval-uation of a prototype implementation of the code repo-
sitioning technique here. The prototype implementationwas based on IBM JDK 1.3.0[11]. The primary goals ofour evaluations were to determine whether code repo-sitioning could achieve significant improvement in per-formance, and to analyze the system’s behavior at theprocessor level.
3.1 Environment
All the experiments were performed on a 1 GHz Pen-tium III system[4] with 512 MB of RAM, running RedHat Linux 7.2, kernel 2.4.9-13. The benchmarks weused were SPEC jbb2000[5], and Call Control Simula-tor. SPEC jbb2000 is a benchmark designed to evalu-ate server performance by emulating a 3-tier middlewaresystem. It outputs a score for comparing the perfor-mance of JavaVMs and underling platforms. Call Con-trol Simulator is our original benchmark emulating aJAIN Call Control[3] platform running on JBOSS ap-plication server [7]. It requires a client which generatesan input sequence and outputs a throughput every sec-ond.
Table3.1 summarizes the key characteristics of thePentium III processor. The processor has two L1 cacheswhich are one for code and one for data, one L2 cache forboth of them, and two TLB which are I-TLB for codeand D-TLB for data. The technique discussed in this pa-per affects behavior of the L1 code cache, the L2 cache,and the I-TLB.
3.2 Benchmark Results
To maximize the performance improvement, the profilesfor code repositioning must be gathered during the mostappropriate stage of the application’s life cycle. To ful-fill this requirement, we made the input sequence of CallControl Simulator be continuous for the start-up periodand the warm-up period, arrange the threshold for start-ing and stopping the profiling and confirmed that theprofiling is finished until the end of the warm-up period.
In the case of generic benchmarks like SPEC jbb2000it’s not possible to arrange the input sequence, so we justdetermined the thresholds. Therefore the improvementof it may be relatively small.
The effectiveness of the code repositioning techniquedepends on the size of the application’s code working
Table 1: Specification of PentiumPentium III
L1 code cache size 16KB4way
Cache line size 32BL2 cache size 256KB
4wayCache line size 32B
I-TLB size 32 entry4 way
4
set. Table 2 lists the benchmark specifications; the sizeof compiled codes and the number of repositioned meth-ods. The former is the size of the native code compiledfrom frequently invoked methods, and the latter is thenumber of methods invoked frequently while the profilesare gathered and repositioned by recompiling. The sizeof compiled codes is different from the size of the appli-cation’s code working set because JIT just compiles amethod invoked many times. But relations of these twosize are similar to benchmark applications.
Table 3 shows the results for Call Control Simulator.“Cycles instruction fetch stalled” means the number ofcycles the processor stalls while it cannot fetch a next in-struction. The throughput is improved 5.0%, The cyclesinstruction fetch stalled reduced by 9.4% and the I-TLBhit rate was significantly improved. On the other hand ,the L1 cache hit rate and the L2 cache hit rate were notimproved after repositioning codes.
Table 4 shows the results for SPEC jbb2000. The scorewas improved by 1.23%, The cycles instruction fetchstalled reduced by 10.2% and the I-TLB hit rate wasimproved by 9.82%. On the other hand, the L1 cachehit rate and the L2 cache hit rate are not improved assame as the results for Call Control Simulator.
According to these results, major improvements of thecode repositioning technique are derive from the effi-ciency improvements of page usage, not from the effi-ciency improvements of cache usage. The main reasonsof non-effectiveness for cache usage are as follows.
• The size of the cache line of each cache is smallenough. Because the size of methods are much big-ger than the size of cache line, there are normallycodes of only one method in one cache line. There-fore the code repositioning and the method splittingcannot improve the efficiency of cache usage
• The size of the application’s code working set issmall enough relatively to the size of each cache.
3.3 Repositioning Analysis
We observed the frequency of method invocation to ver-ify that the code repositioning algorithm could put fre-quently used portions of code closer together. Figure4 shows the distribution of method invocations for CallControl Simulator before code repositioning, while Fig-ure 5 shows the distribution after code repositioning. Ineach figure, the x-axis represents the head address of thememory pages, and the y-axis represents the number of
Table 2: Application sizeSize of Number of
compiled codes repositioned(bytes) methods (bytes)
SPEC jbb2000 442,140 79Call Control 1,940,012 320Simulator
Table 3: Results for Call Control SimulatorDefault Repositioned Percentages
improvementThroughput 65.5 68.8 5.0%
Cycles instruction 2.12×1010 1.92×1010 9.4%fetch stalled
L1 cache hit rate 97.8% 97.8% 0.0%L2 cache hit rate 70.9% 71.3% 0.4%I-TLB hit rate 79.7% 85.4% 5.7%
Table 4: Results for SPEC jbb2000Default Repositioned Percentages
improvementScore 10,113 10,238 1.2%
Cycles instruction 1.14×1010 1.03×1010 9.8%fetch stalled
L1 cache hit rate 98.5% 98.6% 0.1%L2 cache hit rate 90.2% 90.6% 0.4%I-TLB hit rate 82.5% 92.6% 10.2%
Figure 4: Distribution of method invocation before coderepositioning
Figure 5: Distribution of method invocation after coderepositioning
5
! "$# &% ! ' # )(* + , ! ! ! - )! ".# /% ! '0(* + , ! ! !
1 32445% ! ' # )(* + , ! ! ! 1 326440% ! '0(* + , ! ! !
Figure 6: Coverage of application footprint
invocations of the methods in a page. The distributionbefore the code repositioning shows spread over manypages, while the distribution after the code reposition-ing shows that frequently invoked methods were placedin a smaller number of pages, and most of method invo-cations were limited to methods of these pages.
Figure 6 shows the coverage rate for each applicationbefore and after code repositioning in terms of the num-ber of pages. Because Call Control Simulator was thelargest application, more pages were required to coverits footprint. Before code repositioning 55% of invoca-tions could be covered in 32 pages, which is the numberof I-TLB entries available with a Pentium III. After coderepositioning, on the other hand, 79% number of invo-cations could be covered in 32 pages. The smallest ap-plication, jbb2000, covered 94% in 32 pages before coderepositioning and 99% after code repositioning.
4 Conclusion
This paper has described the techniques and evaluatedthe performance of code repositioning for a JavaVMto improve the performance of a large-scale applicationserver. With this approach, the system profiles the ap-plication’s behavior during the most efficient stage of itslife cycle. It then repositions the compiled code to min-imize the memory footprint of the application’s workingset of instructions.
The results of our evaluation demonstrated impressiveperformance improvements 5.0% for Call Control Simu-lator and from 1.2% for SPEC jbb2000. The most im-provement is caused by improvement of efficiency usageof pages and TLB.
Our future plan is to combine this code repositioningapproach and the inline method expansion technique.
References
[1] M. Arnold, M. Hind, and B.G. Ryder, “Onlinefeedback-directed optimization of java,” Proceed-ings of the 17th ACM conference on Object-orientedprogramming, systems, languages, and applica-tions, pp.111–129, ACM Press, 2002.
[2] R.G. Burger and R.K. Dybvig, “An infrastructurefor profile-driven dynamic recompilation,” Interna-tional Conference on Computer Languages, pp.240–251, 1998.
[3] JAIN Community, http://java.sun.com/products/jain/.
[4] Intel Corporation, “The in-tel xeon processor manuals.”http://developer.intel.com/design/Xeon/manuals/.
[5] The Standard Performance Evaluation Cor-poration, “Spec jbb2000 benchmarks.”http://www.spec.org/osg/jbb2000/.
[6] K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu,and T. Nakatani, “A study of devirtualization tech-niques for a java just-in-time compiler,” Proceed-ings of the conference on Object-oriented program-ming, systems, languages, and applications, pp.294–310, ACM Press, 2000.
[7] JBOSS, http://www.jboss.org/.
[8] M. Paleczny, C. Vick, and C. Click, “The javahotspot(tm) server compiler,” USENIX Java(TM)Virtual Machine Research and Technology Sympo-sium, April 2001.
[9] K. Pettis and R.C. Hansen, “Profile guided codepositioning,” Proceedings of the conference on Pro-gramming language design and implementation,pp.16–27, ACM Press, 1990.
[10] E. Sirer, A. Gregory, and B. Bershad, “A practicalapproach for improving startup latency in java ap-plications,” In Workshop on Compiler Support forSystems Software, INRIA Technical Report ]0228,pp.47–55, 1999.
[11] T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu,and T. Nakatani, “A dynamic optimization frame-work for a java just-in-time compiler,” Proceedingsof the OOPSLA ’01 conference on Object OrientedProgramming Systems Languages and Applications,pp.180–195, ACM Press, 2001.
[12] A. Tamches and B.P. Miller, “Fine-grained dy-namic instrumentation of commodity operating sys-tem kernels,” Operating Systems Design and Imple-mentation, pp.117–130, 1999.
6
Session 2: Numerical Computations
Designing polylibraries to speed up linear algebra
computations∗†
Pedro Alberti, Pedro Alonso, Antonio VidalDepartamento de Sistemas Informaticos y Computacion.
Universidad Politecnica de Valencia.
Cno. Vera s/n, 46022 Valencia, Spain.
palberti,palonso,[email protected]
Javier CuencaDepartamento de Ingenierıa y Tecnologıa de Computadores.
Universidad de Murcia. Aptdo 4021. 30001 Murcia, Spain.
Domingo GimenezDepartamento de Informatica y Sistemas.
Universidad de Murcia. Aptdo 4021. 30001 Murcia, Spain.
June 27, 2003
Abstract
In this paper we analyze the design of polylibraries, where the pro-grams call routines from different libraries according to the characteristicsof the problem and of the system used to solve it. An architecture for thistype of libraries is proposed. Our aim is to develop a methodology whichcan be used in the design of parallel libraries. To evaluate the viabilityof the proposed method, the typical linear algebra libraries hierarchy hasbeen considered. Experiments have been performed in different systemsand with linear algebra routines from different levels of the hierarchy. Theresults confirm the design of polylibraries as a good technique for speedingup computations.
Keywords: linear algebra computation, polylibraries, automatic opti-mization, parallel linear algebra hierarchy, linear algebra modelling.
∗This work has been funded in part by Comision Interministerial de Ciencia y Tec-nologıa, project number TIC2000/1683-C03, and Fundacion Seneca, project number PI-34/00788/FS/01
†This work has been developed using the systems of the CEPBA (European Center forParallelism of Barcelona), the CESCA (Centre de Supercomputacio de Catalunya), and theInnovative Computing Laboratory at the University of Tennessee
1
1 Introduction
The idea we put forward in this paper is to use a number of different basiclibraries to develop a polylibrary. To evaluate the method, the typical parallellinear algebra libraries hierarchy has been considered.
Linear algebra functions are widely used in the solution of scientific andengineering problems, which means that there is much interest in the develop-ment of highly efficient linear algebra libraries, and these libraries are used bynon-experts in parallel programming. In recent years several groups have beenworking on the design of highly efficient parallel linear algebra libraries. Thesegroups work in different ways: optimizing the library in the installation processfor shared memory machines [1] or for message-passing systems [2, 3], analyzingthe adaptation of the routines to the conditions of the system at a particularmoment [4, 5, 7], or using polyalgorithms [8, 9]. Polylibraries can be used incombination with other techniques to speed up linear algebra computations, insuch a way that a user will not need to decide the best library to use, and thepolylibrary is responsible for obtaining a near optimum execution.
In Section 2, an architecture for a polylibrary is proposed, and the ensuingvariations in the library hierarchy are considered. Section 3 shows some of theexperimental results obtained. Finally, in Section 4, some conclusions are given.
2 The Design of Polylibraries
When solving linear algebra problems, basic linear algebra routines are calledto solved different parts of the problem. Typically some version of BLAS [10] orLAPACK [11] is used in sequential or shared memory systems, and ScaLAPACK[12] can be used in message-passing systems. But there may be different versionsof the libraries. It is possible to have libraries optimized for the machine (eitherprovided by the vendor or freely available [13, 14]), or automatic tuning librarieslike ATLAS [1] can be installed, or again alternative libraries can be used [15].
The hierarchy of linear algebra libraries is shown in figure 1. At the lowestlevel of the hierarchy is the basic linear algebra library (BLAS), which may bethe reference BLAS, a BLAS optimized for the machine, provided by the vendoror freely available, or an automatic tuning library. To perform the communica-tions in message passing programs a library optimized for the machine can beused, as could a free version of MPI [16] or PVM [17]. At higher levels of thehierarchy there are also different libraries with the same or similar functional-ity. To solve linear systems or eigenvalue problems in sequential processors, areference or machine optimized LAPACK can be used. To solve the problemin a message-passing platform, different versions of ScaLAPACK or PLAPACKcan be used.
By using libraries from the lower or the middle levels of the hierarchy, wecan develop libraries to solve problems in different fields (libraries to solve ODEproblems, inverse eigenvalue problems, least square Toeplitz problems, etc).
For example, it is possible to solve an eigenvalue problem using ScaLAPACK,
2
ODE library eigenvalue problem library least square Toeplitz library
reference LAPACK
machine LAPACK
ESSL
reference ScaLAPACK
machine ScaLAPACK
PLAPACK
PBLAS
BLACS
reference BLAS
machine BLAS
ATLAS
machine MPI
MPICH
LAM
PVM
Figure 1: Hierarchy of linear algebra libraries.
LAPACK, ATLAS and machine MPI; alternatively, it could be solved usingPLAPACK, machine BLAS and MPICH. Likewise, the same problem, withdifferent sizes, may require distinct libraries.
2.1 The Advantage of Polylibraries
Traditionally the user decides to use a particular set of libraries in the solutionof all the problems. There are some considerations which make it advisable touse different libraries for different problems. Automatic selection is thereforeinteresting for a number of reasons:
• A library optimized for the system might not be available. The use ofan automatic tuning library would be satisfactory, although with timea better library may become available. Thus, it is necessary to have acommon method to evaluate libraries or routines in the libraries, so thatwhen a library is installed, the preferred library is changed.
• When the characteristics of the system change, the preferred library canalso change. This makes it necessary to reinstall the libraries.
• Which library is the best may not be clear since library preference will varyaccording to the routines and the systems in question. This makes it nec-
3
essary to develop the concept of polylibrary. The information generatedin the installation of each library is used to construct the polylibrary.
• Even for different problem sizes or different data access schemes the pre-ferred library can change. This makes it necessary to design polylibrariesin such a way that a routine in the polylibrary calls to one library or an-other not only according to the routine, but also to other factors such asthe problem size or the data access scheme. Hence, the routines in thepolylibrary act as interfaces between the program and the routines of thedifferent libraries.
• We may be working in a parallel system with the file system shared byprocessors of different types, which may or may not share the libraries.In the case of their sharing a basic library, this may be optimized for onetype of processors and not for the others. In any case, the resulting librarywill be different for the different types of processors. In parallel systemsthe polylibrary will be different in processors that do not share the basiclibraries, while in processors sharing the basic libraries, the polylibrarycould be the same, but with the inclusion in the interface of the routinesof the possibility of calling to one library or another, according to theprocessor type.
In this paper, we propose the architecture of a polylibrary, and show withexperiments with some linear algebra routines how the use of polylibraries helpsin the decision of the libraries to be used to reduce the execution time whensolving a problem.
2.2 Architecture of a Polylibrary
A possible architecture for a polylibrary is shown in figure 2. For each of thebasic libraries an installation file (LIF) will be generated, containing informa-tion about the performance of the different routines in the library for differentvariable factors. The format of the different LIFs must be the same, and thegeneration of the LIFs must be carried out in the polylibrary installation orupdating process. This generation can be made by using some model of theroutines to predict the execution time or by performing some executions [18, 2].When the generation is made by performing some executions, the process mustbe guided in order to avoid an overlong installation [19]. For each routine in thepolylibrary, an interface routine is developed.
The polylibrary will use the information generated in the lower level librariesinstallation (LIFs) to generate the code of the interface routine. To do this, theformat of the different LIFs would be the same, and what we propose is todevelop each library together with an installation engine which generates theLIF in the format required. This format must be decided by the linear algebracommunity in order to facilitate the use of the information to design polylibrariesor libraries of higher level in the libraries hierarchy. The information in the LIFcould be the cost of an arithmetic operation using some routines. In some cases
4
Installation
Library 2
Installation
Library 2
Installation
Library 1
LIF 1
Polylibrary interface routine 1 interface routine 2 …
interface routine 1 if n<value call routine 1 from library 1 else depending on data storage call routine 1 from library 1 or call routine 1 from library 2 …
LIF 3 LIF 2
Figure 2: Architecture of a polylibrary.
this could be obtained for different problem sizes, because the use of differentcosts for different sizes can produce a better prediction and consequently areduction in the execution time [20].
For some routines more than one parameter can be used, as for example inmatrix-matrix multiplication, where it could be interesting to store the cost withsquare or rectangular matrices. Also the different access schemes can producedifferent costs, for example, the BLAS routine drot can have different costswhen the data in the vectors are stored in adjacent positions of memory thanwhen they are stored in non-adjacent positions. The format of the LIF couldbe as follows:
drot ld dgemm column1 100 200 20 40 80
100 20n 200 row 40
400 80
The development of installation engines must be included in the basic li-braries (BLAS, ATLAS, MPI, PVM, ...), but also in libraries of higher level(LAPACK, PLAPACK, ScaLAPACK, ...), which allows the development ofhigher level libraries, which could include an installation engine to generatetheir own LIFs.
2.3 Combining Polylibraries with other Optimisation Tech-niques
There are different techniques for obtaining optimized libraries, and better li-braries can be obtained by combining different techniques.
5
In polyalgorithms we have different algorithms to solve the same problem.The algorithms can be implemented in different libraries, and the use of differentalgorithms does not differ from the use of different libraries. If one of the basiclibraries used is an automatically tuned library a large amount of time may benecessary to install the library in the system [1, 2], which means a large amountof time is necessary to update the polylibrary when the conditions on the systemchange. The same is true when installing or reinstalling the polylibrary, and ina parallel system multiple installations will be necessary (possibly in parallel).Thus the installation process must be guided by the system manager [3], whichdecides when it is convenient to reinstall the polylibrary or include a new library,or a new type of processor, and which factors and range of values to consider.
In some cases the value of some algorithmic parameters are obtained at ex-ecution time [20]. These values can be the block size in algorithms by blocks;the number of processors or some parameters determining the logical topologyin a parallel system, ... This means different values of these parameters must beconsidered in the polylibrary installation. The routine interface in the polyli-brary must contemplate this possibility, and the same routine can be called withdifferent values of the parameters for different problem sizes.
It could be interesting to have a system which attends to petitions of theuser and provides a suitable set of resources according to the present state ofthe system. Some research is being done in this direction [5, 6]. To adapt themethod we propose to systems where the load varies dynamically some heuristiccan be used [7]. The values obtained at installation time can be influenced byusing some measures of the system load obtained at running time.
3 Experimental Results
Experiments have been carried out with routines of different levels in the hier-archy:
• In the lowest level of the hierarchy, the routines are the typical BLAS ones.These routines must be installed by performing selected computations,and the information is stored in the LIF of the library. The matrix-matrixmultiplication has been used because it is a representative operation ofthe lowest level libraries, it is used in algorithms by blocks and is a goodexample to show the methodology. In addition, versions by blocks andthe Strassen method [21] have been considered, taking into account thecombination of the design of polylibraries and polyalgorithms.
• Matrix factorizations by blocks [22] have been analyzed (LU and QR)because they are basic operations at a second level of libraries, and useroutines of a lower level, as for example the matrix-matrix multiplication.
• As examples of the highest level of libraries a lift-and-project algorithmto solve the inverse additive eigenvalue problem [23] and an algorithmto solve the Toeplitz least square problem [24, 25] have been considered.
6
In the former, the cost is of order O(n4
)and it is easy to make a good
choice of the parameters to obtain executions close to the optimum. Thesecond algorithm has order O
(n2
), and it is more difficult to make a good
selection of the parameters. In both cases they use routines of lower levels,and those routines are used in a blackbox way.
The experiments have been carried out over different systems because thegoal is to obtain a methodology valid for a wide range of systems. The platformsused have been: a shared memory space system (Origin 2000 with 32 processors),and some networks of processors.
Some of the experimental results obtained are reported in this section. Moreexperimental results can be found in [26].
3.1 Matrix-matrix Multiplication
The matrix-matrix multiplication is an operation from the lowest level of thehierarchy, but higher level implementations have been considered. For polyalgo-rithms we use a direct multiplication, a multiplication by blocks, and a Strassenmultiplication. The use of these algorithms allows us to combine the design ofpolylibraries with the automatic choice of algorithmic parameters: in the algo-rithm by blocks the block size is an algorithmic parameter, and in the Strassenmultiplication the algorithmic parameter is the size of the matrix to stop therecurrence.
For routines in the lowest level of the hierarchy only the sequential case isanalyzed, because these routines would be used by parallel routines which willcall them in each processor.
In a network of six SUNs with two different types of processors (five SUNUltra 1 and one SUN Ultra 5) the interface of the routine would have one part foreach type of processor. The libraries used in the experiments have been: a freelyavailable reference BLAS (BLASref), the scientific library of SUN (BLASmac),and three versions of ATLAS: a version of ATLAS installed in SUN Ultra 1(ATLAS1), and another two prebuilt versions of ATLAS 3.2.0 [30] for Ultra 2(ATLAS2) and Ultra 5 (ATLAS5). The same libraries are used by the SUNUltra 1 and the SUN Ultra 5 because they share the file system.
The values in the LIF can be obtained by performing executions for thedifferent algorithms, problem sizes and values of the parameters provided bythe library manager. This may suppose a large amount of time, and heuristicscan be used to reduce that time. For example, in the algorithm by blocks,experiments can be performed for different blocks for the smallest problem size.In the Strassen algorithm, for the smallest problem size, the problem is solvedwith base size half the problem size, then a quarter of the problem size, and soon until the execution time increases. The optimum base size is taken as theinitial for the local search for the next problem size.
If the installation is made with matrix sizes 400, 800 and 1200, the interfaceroutine could be:
if processor is SUN Ultra 5
7
if problem-size<600solve using ATLAS5 and Strassen method with base size
half of problem sizeelse if problem-size<1000
solve using ATLAS5 and block method with block size 400else
solve using ATLAS5 and Strassen method with base sizehalf of problem size
endifelse if processor is SUN Ultra 1
if problem-size<600solve using ATLAS5 and direct method
else if problem-size<1000solve using ATLAS5 and Strassen method with base size
half of problem sizeelse
solve using ATLAS5 and direct methodendif
endif
The lowest execution time (low.) is compared with the execution time withthe library, method and parameter provided by the model (mod.), and thatobtained using the direct method with ATLAS5. The comparison is shown intable 1. The execution time using the model is close to the lowest, and thesame happens when the direct method and ATLAS5 are used. The conclusioncould be that it is better to determine the best library (ATLAS5) and method(direct). Possibly this is a good option in some cases, but we observe that theoptimum library is not always apparent. The preferred library should be thescientific library of SUN, but it is clearly surpassed by all the versions of ATLAS.In SUN Ultra 1, one would expect the preferred ATLAS to be that installed inthe system by the manager (ATLAS1), but in fact the best is ATLAS5.
The selection of the library, method and parameters is satisfactory in allthe systems considered, and this produces small differences between the lowestexecution time and that obtained with the values provided by the polylibrary.
The use of different algorithms and libraries is not a great improvement withrespect to the use of the best library, but the proposed method can also be usedjust to obtain or update which is the best library among those available.
The method can also be used to decide which algorithm to choose from a setof available algorithms, because each algorithm can be considered as the routineof a different library.
The selection of the parameters which produce lowest execution times canbe combined with the selection of the library or algorithm.
8
Table 1: Comparison of the lowest execution time, the execution time obtainedusing the model, and the execution time using the direct method and ATLAS5.In SUN Ultra 5.
200 600 1000 1400 1600time 0.038 1.063 4.678 12.532 20.031
low. library ATLAS5 ATLAS5 ATLAS5 ATLAS2 ATLAS2method direct direct Strassen Strassen blocks
parameter 2 2 400time 0.041 1.112 4.678 12.575 26.569
mod. library ATLAS5 ATLAS5 ATLAS5 ATLAS5 ATLAS5method Strassen blocks Strassen Strassen Strassen
parameter 2 400 2 2 2ATLAS5 direct time 0.038 1.063 4.833 13.501 31.023
3.2 Matrix factorizations
Automatic tuning medium level routines can be developed by obtaining a the-oretical model of the routine where the system parameters are identified, andthe values of the system parameters are obtained from those stored in the LIFsof lower level libraries. After the substitution of the values some decisions canbe taken, e. g. the best library and the optimum values of some algorithmic pa-rameters. We show how the method works with the LU and QR factorizations.
One of the most difficult parts for the development of a hierarchy like theone proposed is to obtain a common cost model valid for a wide range of sys-tems and algorithms. This is not the goal of this paper, and there are somecomplementary works on this issue [27, 3, 28, 29]. Some of the ideas to considerwhen obtaining a simplified but useful model are: the parameters in the theo-retical model are taken not as constant values, but as a function of the problemsize and some parameters which depend on the algorithm (the block size, thenumber of processors, the topology of the communication network, etc); differ-ent parameters for different operations at the same computational level; andthe values of the parameters which are also dependent on the way the data arestored and accessed.
With these considerations, the cost of the parallel block LU factorizationcan be modelled by the formulae:
23k3,dgemm
n3
p+
r + c
pbk3,dtrsmn2 +
13b2k2,dgetf2n (1)
for the arithmetic cost, and:
ts2n max(r, c)
b+ tw
2n2 max(r, c)p
(2)
for the communication cost.And for the QR factorization the costs are modelled by:
9
43k3,dgemm
n3
p+
14c
bk3,dtrmmn2 +1rbk2,dgeqr2n
2 +12r
bk2,dlarftn2 (3)
and:
tsn
b((2+3b) log r+2 log c)+tw
(n2
2
(2(r − 1)
r2+
log r
c+
log r
r
)+ nb log p
)(4)
The matrix is distributed in a logical two-dimensional mesh (in ScaLAPACKstyle [12]) of p = r × c processors, with block size b.
The system parameters are k3 (operations of level 3 BLAS), k2 (operationsof level 2 BLAS), ts (start-up time) and tw (word-sending time); and the al-gorithmic parameters b (block size), r and c (number of rows and columnsin the network of processors). The cost of the arithmetic operations is con-sidered different for different routines (k3,dgemm, k3,dtrsm, k2,dgetf2, k3,dtrmm,k2,dgeqr2 and k2,dlarft). The two factorizations have in common the parameterk3,dgemm. Hence, they will use the information (LIF) generated for k3,dgemm
when the matrix-matrix multiplication was installed in the system. The li-brary (l) can also be considered as an algorithmic parameter. Thus we havek3,dgemm(n, b, r, c, l), and the same with the other system parameters. In theinstallation of a routine like dgemm, the dependence of the system parameterswith respect to the algorithmic parameters is analyzed, and possibly the numberof dependencies will be reduced: the dependence may be k3,dgemm(b, l).
Experiments have been carried out in a network of Pentium III with Myrinetand Fast-Ethernet. Five libraries have been used: ATLAS, reference BLAS, aBLAS for Pentium II Xeon (BLASII), and two versions of a BLAS for PentiumIII (BLASIII). The results obtained with reference BLAS are always worse thanthose with ATLAS, and the second version of BLASIII always works faster thanthe first version.
In table 2 the experimental results with the LU factorization are summa-rized. The theoretical (the.) and lowest (low.) execution times, the executiontime obtained with the parameters provided by the model (mod.), and the asso-ciated block size are shown for different libraries and problem sizes. The lowestoptimum and modelled execution times for each matrix size are highlighted. Themodel provides good values for the parameters. The best results are obtainedwith BLASIII and BLASII, as predicted by the model. Also the optimum blocksize is well predicted. The difference between the libraries is less important inthe parallel case than in the sequential case due to the cost of communications,which is common to all the libraries.
In table 3 results obtained with the QR factorization are shown. The exe-cution time with the parameters provided by the model (mod.) and the lowestexecution time (low.) are shown together with their parameters. The timeswith the model are always close to the lowest times, because the model makesa good selection of the parameters: number of processors, dimensions of themesh, block size and library.
10
Table 2: Execution time (in seconds) of the parallel LU routine, with differentbasic libraries. In a network of 4 Pentium III with Myrinet.
ATLASthe. low. mod.
time blo. time blo. time blo.512 0.13 32 0.12 32 0.12 321024 0.79 32 0.74 32 0.74 321536 2.36 32 2.21 64 2.27 32
BLASIIthe. low. mod.
time blo. time blo. time blo.512 0.13 32 0.11 32 0.11 321024 0.77 32 0.71 32 0.71 321536 2.30 32 2.13 32 2.13 32
BLASIIIthe. low. mod.
time blo. time blo. time blo.512 0.13 32 0.11 32 0.11 321024 0.77 32 0.70 32 0.70 321536 2.30 32 2.13 32 2.13 32
Table 3: Comparison of the execution time of the QR factorization. In a networkof 8 Pentium III with Fast-Ethernet.
size: 128 256 384 1024 2048 3072
time 0.025 0.087 0.178 1.92 9.90 28.00mod. processors 1 × 2 1 × 8 1 × 8 1 × 8 1 × 8 1 × 8
block size 8/16 8 8 16 16 16library BLASII BLASII BLASII BLASII BLASII BLASII
time 0.025 0.086 0.176 1.92 9.40 25.11low. processors 1 × 2 1 × 8 1 × 8 1 × 8 1 × 8 1 × 8
block size 16 8 8 16 16 16library BLASII BLASIII BLASIII BLASII BLASII BLASII
11
3.3 Experiments with a Toeplitz Least Square program
The problem to solve is the Least Square (LS) Problem: maxx ||Tx−b||2, whereT ∈ Rm×n is Toeplitz (Tij = ti−j , i = 0, 1, . . . ,m − 1, j = 0, 1, . . . , n − 1) andb ∈ Rm.
Experiments have been carried out in two networks of Pentiums with Myrinet.In a Pentium II network with Myrinet the execution time in the sequential case(seq.), the lowest experimental time (low.) and the execution time with thenumber of processors provided by the model (mod.), are those shown in table4, along with the number of processors used and the speed-up. We can seethe model does not provide the best number of processors, and the differencebetween the optimum execution time and the execution time with the numberof processors provided by the model is on average about 58%. For algorithmswith a small cost it could be preferable to perform some selected executions atinstallation time to decide the values of the parameters, and to use the modelonly for bigger problem sizes [18].
Table 4: Comparison of optimum execution time and the execution time withthe parameters provided by the model. Routine LS Toeplitz in network ofPentium II with Myrinet.
m× n seq. low. mod.time proc. speed-up time proc. speed-up
1000× 500 0.868 0.341 16 2.55 0.868 1 1.001000× 1000 2.144 0.988 8 2.17 1.715 2 1.252000× 200 0.600 0.178 16 3.37 0.383 2 1.572000× 1000 3.630 1.068 16 3.40 1.580 4 2.302000× 2000 9.225 2.922 16 3.16 4.151 4 2.22
The library preferred in the sequential case could also be considered as thatpreferred for the parallel routine. In that way ATLAS would be used in thePentium III network. In table 5 the execution times are compared when usingthe different libraries. The lowest execution time is highlighted for each matrixand number of processors. We can see the best library is now BLASII, whichis the second best in the sequential case. This may be due to the fact thatin parallel we work with smaller vectors in each processor, and the preferredlibrary with the sizes experimented in the sequential case may not be the onepreferred for smaller sizes.
3.4 Lift-and-project Method for the Inverse Additive Eigen-value Problem
A lift-and-project method for the solution of the inverse additive eigenvalueproblem has been used as an example of high level algorithm of high cost. Theorder of the algorithm is O
(n4
), and this high order makes it easier to obtain
12
Table 5: Comparison of the execution time (in seconds) of the parallel routinewith different basic libraries. Routine LS Toeplitz in network of Pentium IIIwith Myrinet.
processors: 1 2 3 4ATLAS
400 0.4566 0.5870 0.5280 0.4151800 1.0109 1.4078 1.1537 1.10231200 1.9772 2.6694 2.3173 1.8172
BLASII400 0.6115 0.5554 0.5088 0.4073800 1.2751 1.3887 1.1412 0.96501200 2.4280 2.5838 2.1818 1.7185
BLASIII400 0.6329 0.5850 0.5180 0.4170800 1.3183 1.4244 1.1862 1.04441200 2.5290 2.5745 2.2229 2.0067
optimizations than in the Toeplitz Least Square algorithm considered, whichhas an order O
(n2
).
The problem is: given a set of L+1 real symmetric matrices A0, A1, ..., AL, ofsizes n×n, and a set of real numbers λ∗
1 ≤ λ∗2 ≤ λ∗
m (m ≤ n), obtain d ∈ RL anda permutation σ = σ1, . . . , σm, with 1 ≤ σ1 < . . . < σm ≤ n, which minimizesthe function F (d, σ) = 1
2
∑mi=1 (λσi
(d)− λ∗i )
2, with λi(d), i = 1, 2, . . . , n, theeigenvalues of matrix A(d) = A0
∑Li=1 diAi.
The lift-and-project method used to solve the Least Square problem has beenthat described in [23]. We analyze the different parts (TRACE, ADK, EIGEN,MATEIG, MATMAT and ZKA0A) to identify the lower level routines used inthe solution of the problem, and the cost of the different parts of the algorithm.The theoretical model obtained is:
iter
(223
ksyev + 2k3,gemm + k3,diaggemm
)n3 +
iter (2k1,dot + k1,scal + k1,axpy) n2L + 2k1,dotn2L2 + ksumnL2 (5)
where iter represents the number of iterations to reach the convergence.In the formula there appear five parameters of routines from the lowest level
of the hierarchy (BLAS), and one from the middle level (LAPACK). Three ofthe parameters of BLAS correspond to operations of BLAS 1, and two are fromBLAS 3, but of the same operation (gemm) with two different types of matrices.
Table 6 shows the time obtained for a 250×250 matrix and with L = 25. Thetotal time and the time in each part of the algorithm appear in the table. Thecombinations of libraries used are: LAPACK and BLAS installed in the system
13
and supposedly optimized for the machine (LaIn+BlIn), reference LAPACKand a freely available BLAS for Pentium III (LaRe+BIII), reference LAPACKand a freely available BLAS for Pentium II (LaRe+BlII), reference LAPACKand the installed BLAS (LaRe+BlIn), LAPACK and BLAS installed for the useof threads (LaInTh+BlInTh), reference LAPACK and a freely available BLASfor Pentium II using threads (LaRe+BlIITh), and reference LAPACK and theBLAS installed which uses threads (LaRe+BlInTh). The lowest execution timewhen not using threads (lnt.) and when using threads (lwt.), and the lowestexecution time (low.) are shown for each part of the algorithm and the totaltime. We can see the lowest time represents an important reduction with respectto the execution time obtained with the best combination (LaRe+BIII), whichis not the one installed in the system. The versions that use threads are notbetter than those that do not.
Table 6: Comparison of the execution time (in seconds) in the different parts ofthe algorithm and with different basic libraries. In dual Pentium III. n = m =250, L = 25.
libraries TRACE ADK EIGEN MATEIG MATMAT ZKA0A TOTAL
LaIn+BlIn 1.69 12.86 165.81 0.94 98.79 14.22 294.32LaRe+BIII 1.16 14.87 210.85 0.83 26.70 10.46 264.89LaRe+BlII 1.16 15.65 255.20 0.86 10.52 10.44 293.85LaRe+BlIn 1.69 16.41 336.49 1.21 123.73 18.03 497.59
lnt. 1.16 12.86 165.81 0.83 10.52 10.44 201.64
LaInTh+BlInTh 1.10 13.92 266.63 0.66 14.13 12.34 308.80LaRe+BlIITh 1.16 15.68 254.34 0.79 6.66 9.99 288.66LaRe+BlInTh 1.10 13.71 249.59 0.62 13.74 11.90 290.68
lwt. 1.10 13.71 249.59 0.62 6.66 9.99 281.70
low. 1.10 12.86 165.81 0.62 6.66 9.99 197.06
If the information for matrix sizes 150 and 250 is stored in the LIFs (and formatrix size 100 the information associated to 150 is used, and for sizes 200 and300 that associated to 250), then the execution time obtained with the model(mod.) is compared with the execution time of the best library (lib.), withthe time using the best library in each part of the algorithm (low.), and withthe mean of the execution time of all the libraries considered (mea.). The meanvalue represents the execution time a non-expert user could obtain when solvingthe problem using some reasonable values of the parameters. The comparison isshown in table 7. The model gives a good approximation to the lowest achievableexecution time, and the time with the model is clearly better than that withthe best library, and much better than the mean time.
4 Conclusions and Future Work
The architecture of a polylibrary has been described, and the advantage of usingpolylibraries has been shown with experiments with routines of different levels inthe libraries hierarchy, and in different systems. The method can be applied to
14
Table 7: Comparison of the execution time using the model (mod.) with themean execution time (mea.), the time with the best library (lib.) and with thebest library in each part of the algorithm (low.). In dual Pentium III. Squarematrices and L = n
10 .size: 100 200 300mea. 3.57 91.30 601.95lib. 3.06 70.34 417.10
mod. 3.10 68.75 408.99low. 3.02 66.54 406.11
sequential and parallel algorithms, and it can be combined with other methodsof computation speed up.
The LIF of each library will contain the cost of an operation for each one ofthe routines in the library. These costs may be different for different data sizesor access schemes. The main issue for the design of polylibraries is the decisionof the format of the LIFs, which must be taken by the linear algebra community.This would allow the development of a hierarchy of automatic tuning libraries,with associated LIFs.
The proposed method could be applied to help in the development of effi-cient parallel libraries in other fields. At the moment we are researching theapplication to Divide and Conquer and Dynamic Programming algorithms.
References
[1] R. Whaley, A. Petitet and J. Dongarra, Automated Empirical Optimiza-tion of Software and the ATLAS Project, Parallel Computing 27 (2001)3–25.
[2] K. Takahiro, A Study on Large Scale Eigensolvers for Distributed MemoryParallel Machines, Ph. D. Thesis, University of Tokyo, 2001.
[3] J. Cuenca, D. Gimenez and J. Gonzalez, Towards the Design of an Au-tomatically Tuned Linear Algebra Library, in 10th EUROMICRO Work-shop on Parallel, Distributed and Networked Processing PDP 2002, (IEEE,2002) 201–208.
[4] A. Kalinov and A. Lastovetsky, Heterogeneous Distribution of Computa-tions While Solving Linear Algebra Problems on Networks of Heteroge-neous Computers, in High Performance Computing and Networking, eds.P. Sloot, M. Bubak, A. Hoekstra and B. Hetzberger, Lecture Notes inComputer Science, no. 1593, (Springer, 1999) 191–200.
[5] A. Petitet, S. Blackford, J. Dongarra, B. Ellis, G. Fagg, K. Roche andS. Vadhiyar, Numerical Libraries and The Grid, International Journal ofHigh Performance Applications and Supercomputing 15 (2001) 359–374.
15
[6] Z. Chen, J. Dongarra, P. Luszczek and K. Roche, Self Adapting Softwarefor Numerical Linear Algebra and LAPACK for Clusters, Submitted toParallel Computing.
[7] J. Cuenca, D. Gimenez, J. Gonzalez, J. Dongarra and K. Roche, Auto-matic Optimisation of Parallel Linear Algebra Routines in Systems withVariable Load, in 11th EUROMICRO Workshop on Parallel, Distributedand Networked Processing PDP 2003, (IEEE, 2003) 409–416.
[8] E. A. Brewer, Portable High-Performance Supercomputing: High-LevelPlatform-Dependent Optimization, Ph. D. Thesis, Massachusetts Insti-tute of Technology, 1994.
[9] A. Skjellum and P. V. Bangalore, Driving Issues in Scalable Libraries:Polyalgorithms, Data Distribution Independence, Redistribution, LocalStorage Schemes, in 7th SIAM Conference on Parallel Processing for Sci-entific Computing, (SIAM, 1995) 734–737.
[10] J. J. Dongarra, J. Du Croz, I. S. Duff and S.Hammarling, A set of Level3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft. 16 (1990)1–17.
[11] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J.Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen,LAPACK Users’ Guide (SIAM, 1999).
[12] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon,J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walkerand R. C. Whaley, ScaLAPACK Users’ Guide, (SIAM, 1997).
[13] Greg Henry web page, www.cs.utk.edu/~ghenry.
[14] Kazushige Goto web page, www.cs.utexas.edu/users/flame/goto.
[15] R. A. van de Geijn, Using PLAPACK: Parallel Linear Algebra Package,(The MIT Press, 1997).
[16] Message Passing Interface Forum web page, www.mpi-forum.org.
[17] PVM web page, www.epm.ornl.org.gov/pvm/pvm home.html.
[18] D. Gimenez and G. Carrillo, Installation routines for linear algebra li-braries on LANs, in 4th International Meeting VECPAR2000, (Faculdadede Engenharia da Universidade do Porto, 2000) 393–398.
[19] S. S. Vadhiyar, G. E. Fagg and J. Dongarra, Automatically Tuned Collec-tive Communications, in Proceedings of the SC’2000 Conference, Dallas,Tx, Nov. 2000.
16
[20] J. Cuenca, D. Gimenez and J. Gonzalez, Modelling the Behaviour ofLinear Algebra Algorithms with Message Passing, in 9th EUROMICROWorkshop on Parallel, Distributed and Networked Processing PDP 2001,(IEEE, 2001) 282–289.
[21] T. H. Cormen, C. E. Leiserson and R. L. Rivest, Introduction to Algo-rithms, (The MIT Press, 1990).
[22] G. H. Golub and C. F. Van Loan, Matrix Computations, (The JohnsHopkins University Press, 1989).
[23] P. Alberti, Solucion Paralela de Mınimos Cuadrados para el Problema In-verso Aditivo de Valores Propios, Master’s Thesis, Universidad Politecnicade Valencia, 2002.
[24] H. Park and L. Elden, Schur-type methods for solving least squares prob-lems with Toeplitz structure, SIAM J. Sci. Comput. 22 (2000) 406–430.
[25] P. Alonso, J. M. Badıa and A. M. Vidal, A Parallel Algorithm for Solvingthe Toeplitz Least Squares Problem, in Vector and Parallel Processing-VECPAR 2000 eds. J. M. L. M. Palma, J. Dongarra V. Hernandez, LectureNotes in Computer Science, no. 1981, (Springer, 2001) 316–329.
[26] A. Alberti, A. Alonso, A. Vidal, J. Cuenca, L. P. Garcıa and D. Gimenez,Designing polylibraries to speed up linear algebra computations, TR LSI1-2003, Departamento de Lenguajes y Sistemas Informaticos, Universidadde Murcia, 2003. dis.um.es/~domingo/investigacion.html.
[27] K. Dackland and B. Kagstrom, A Hierarchical Approach for PerformanceAnalysis of ScaLAPACK-based Routines Using the Distributed Linear Al-gebra Machine, in Applied Parallel Computing: Computations in Physics,Chemistry and Engineering Science eds. Dongarra et. al. (Springer-Verlag,1996) 187–195.
[28] J. Cuenca, L. P. Garcıa, D. Gimenez, J. Gonzalez and A. Vidal, Empir-ical Modelling of Parallel Linear Algebra Routines, Submited to PPAM2003, Fifth International Conference on Parallel Processing and AppliedMathematics.
[29] Yunquan Zhang and Ying Chen, Block size selection of parallel LU andQR on PVP-based and RISC-based supercomputers, R & D Center forParallel Software, Institute of Software, Beijing, China (2003) 100–117.
[30] ATLAS web page at netlib, www.netlib.org/atlas/index.html.
17
A Parallel Implementation of Multi-domain High-order Navier-StokesEquations Using MPI
Hui Wang1, Yi Pan2, Minyi Guo1, Anu G. Bourgeois2, Da-Ming Wei11 School of Computer Science and Engineering,
University of Aizu, Aizu-Wakamatsu, Fukushima 965-8580, Japanhwang,minyi,[email protected]
2 Department of Computer Science,Georgia State University, Atlanta, GA 30303, USA
Abstract
In this paper, Message Passing Interface (MPI) techniques are used to implement high-order full 3-D Navier-Stokes equations in multi-domain applications. A two-domain interfacewith five-point overlap used previously is expanded to a multi-domain computation. Thereare normally two approaches for this expansion. One is to break up the domain into twoparts through domain decomposition (say, one overlap), then use MPI directives to furtherbreak up each domain into n parts. Another is to break the domain up into 2n parts with2n− 1 overlaps. In our present effort, the latter approach is used and finite-size overlaps areemployed to exchange data between adjacent multi-dimensional sub-domains. It is generalenough to parallelize the high-order full 3-D Navier-Stokes equations into multi-domain ap-plications without adding much complexity. Results with high-order boundary treatmentsshow consistency among multi-domain calculations and single-domain results. These exper-iments provide new ways of parallelizing 3-D Navier-Stokes equations efficiently.
1 Introduction
Mathematicians and physicists believe that the phenomenon of the waves following boats and theturbulent air currents following flights can be explained through an understanding of solutions tothe Euler Equations or Navier-Stokes equations. In fact, Euler Equations are simplified Navier-Stokes equations. The Navier-Stokes equations describe the motion of a fluid in two or threedimensions. These equations need to be solved for an unknown velocity vector (say, u(x, t)) andpressure vector (say, p(x, t)) with position x and time t.
In nonlinear fluid dynamics, the full Navier-Stokes equations are written in strong conserva-tion form for a general time-dependent curvilinear coordinate transformation (x, y, z) to (ξ, η, ζ)[1, 2]:
∂
∂t(UJ
) +∂F
∂ξ+
∂G
∂η+
∂H
∂ζ=
1Re
[∂Fv
∂ξ+
∂Gv
∂η+
∂Hv
∂ζ] +
SJ
(1)
where F, G,H are inviscid fluxes, and Fv, Gv,Hv are the viscous fluxes. Here, the unknownvelocity U = ρ, ρu, ρv, ρw, ρEt is the solution vector, and the transformation Jacobian J is afunction of time for dynamic meshes. S is the components of a given, externally applied force.
1
Table 1: The parameters, points, and orders for different schemes in common use.
Scheme Point and Order Γ a b
Compact 6 5-point, 6th-order 1/3 14/9 1/9Compact 4 3-point, 4th-order 1/4 3/2 0Explicit 4 3-point, 4th-order 0 4/3 -1/3Explicit 2 1-point, 2nd-order 0 1 0
Curvilinear coordinates are defined as those with a diagonal metric so that general metric hasgµν ≡ δµ
ν (hµ)2 with the Kronecker Delta δµν .
Equation (1) is simply Newton’s law f = ma for a fluid element subject to the external forcesource term S. Until now however, no one has been able to find the full solutions without helpfrom numerical computations since they were originally written down in the 19th Century.
In order to discretize the above equations, a finite-difference scheme is introduced [2, 3, 4, 5]because of its relative ease of formal extension to higher-order accuracy. The spatial derivativefor any scalar quantity φ, such as a metric, flux component, or flow variable, is obtained in thetransformed plane by solving the tridiagonal system:
Γφ′i−1 + φ
′i + Γφ
′i+1 = b
φi+2 − φi−2
4+ a
φi+1 − φi−1
2(2)
where Γ, a and b determine the spatial properties of the algorithm. Table 1 lists the values ofparameters Γ, a and b, points, and orders in common use for different schemes. The dispersioncharacteristics and truncation error of the above schemes can be found in Reference [2].
A high-order implicit filtering technique is introduced in order to avoid the instabilities, whilemaintaining the high accuracy of the spatial compact discretization. Filtered values φ satisfy,
αf φi−1 + φi + αf φi+1 =N∑
n=0
an
2(φi+n + φi−n) (3)
where the parameter αf has −0.5 < αf < 0.5. Pade-series and Fourier-series analysis areemployed to derive the coefficients a0, a1, . . . , aN . Using curvilinear coordinates, up to 10th-order filter formulas has been developed to solve Naiver-Stokes equations.
Reference [2] provides a numerical computation method to solve the Navier-Stokes equations.To yield highly accurate solutions to the equations, Visbal and Gaitonde use higher-order finite-difference formulas with emphasis on Pade-type operators, which are usually superior to TaylorExpansions when the functions contain poles, and a family of single-parameter filters. Bydeveloping higher-order one-sided filters, Reference [2, 5] obtain enhanced accuracy without anyimpact on stability. With non-trivial boundary conditions, highly curvilinear three dimensionalmeshes, and non-linear governing equations, the formulas are suitable in practical situations.Higher-order computation, however, may increase the calculation time.
To satisfy the need for highly accurate numerical schemes without increasing the calculationtime, parallel computing techniques constitute an important component of modern computa-tional strategy. Using parallel programming methods on parallel computers, we gain access to
2
greater memory and CPU resources not available on serial computers, and are able to solve largeproblems more quickly.
A two-domain interface has been introduced in Reference [5] to solve the Navier-Stokesequations. Suppose there is a two-domain interface, each domain runs first with its inputdata. While each domain writes and sends the data which are needed by another domain to atemporary file located on hard disk, it also checks and reads the temporary file to retrieve thedata output from another domain. This manually treatment, however, has some shortcomings.It requires much more time because it needs to write the data to hard disk. Because of its low-speed computation and limitation of memory, it is obvious that this method is not convenientto be employed for multi-domain calculations.
In our work, parallelizing techniques using Message Passing Interface (MPI) [6, 8, 7, 9] areused to implement high-order full three dimensions Navier-Stokes equations in multi-domainapplications on a multiprocessor parallel system. The two-domain interface used in [5] willbe expanded to multi-domain computations using MPI techniques. There are two approachesfor this expansion. One approach is to break up the domain into two parts through domaindecomposition (say, one overlap), then use MPI directives to further break up each domain inton parts. Another approach is to break it up into 2n domains with 2n - 1 overlaps. The laterapproach is used in this paper and we employ finite-sized overlaps to exchange data betweenadjacent multi-dimensional sub-domains. Without adding much complexity, it is general enoughto treat curvilinear mesh interfaces. Results show that multi-domain calculations reproduce thesingle-domain results with high-order boundary treatments.
2 Parallelization with MPI
Historically, there have been two main approaches to writing parallel programs. One is to use ofa directives-based data-parallel language, and another is to explicit message passing via librarycalls from standard programming languages. The later approach, such as Message PassingInterface (MPI), is left up to the programmer to explicitly divide data and work across theprocessors and then manage the communications among them. This approach is very flexible.
In our program, there are two different sets of input files with two domains. One data setis xdvc2d1.dat and xdvc2d1.in, and another is xdvc2d2.dat and xdvc2d2.in, respectively.xdvc2d1.dat and xdvc2d2.dat are parameter files, while xdvc2d1.in and xdvc2d1.in areinitialized as input metrics files. The original input subroutine DATIN is copied to two inputsubroutines DATIN1 and DATIN2. All processes are sharing data from one of the input files:
IF ( MYID .LT. NPES / 2 ) THENCALL DATIN1
ELSE IF ( MYID .GE. NPES / 2 ) THENCALL DATIN2
ENDIF
For convenience, the size of the input metrics for each processor is the same for those sharingthe same input files.
IF ( MYID .LT. NPES / 2 ) THENIMPI = ILDMY / ( NPES / 2 ) * MYID
ELSEIMPI = ILDMY / ( NPES - NPES / 2 ) * ( MYID - NPES / 2 )
ENDIF
3
IL-5 IL-4 IL-3 IL-2 IL-1 IL
1 2 3 4 5 6
Mesh 1
Mesh 2
Arrows show information transfer
Mesh 1 5-pt overlap
Mesh 2
Figure 1: Schematic of mesh overlap with 5-points in a two-domain calculation [5]
ILDMY is the value of size read from input file. In the following codes, ILDMY gets back the valueof size for each processor.
IF ( MYID .EQ. NPES / 2 - 1 ) THENILDMY = ILDMY - ILDMY / ( NPES / 2 ) * ( NPES / 2 - 1 )
ELSE IF ( MYID .EQ. NPES - 1 ) THENILDMY = ILDMY - ILDMY / ( NPES - NPES / 2 ) * ( NPES - NPES / 2 - 1)
ELSE IF ( MYID .LT. NPES / 2 ) THENILDMY = ILDMY / ( NPES / 2 )
ELSEILDMY = ILDMY / ( NPES - NPES / 2 )
ENDIF
All of the above modifications are done in the main program. However, the important partis in subroutine BNDRY, where we add subroutine BNDRYMPI instead of subroutine BNDRYXD inorder to process multi-domain computations using MPI techniques.
4
In BNDRYMPI, the communication between different processes are completed by the functionsof sending messages and receiving messages.
IF ( MYID .GT. 0 ) THENCALL MPI_SEND, SENDS MESSAGES TO PROCESSOR (MYID - 1)CALL MPI_IRECV, RECEIVES MESSAGES FROM PROCESSOR (MYID - 1)CALL MPI_WAIT ( REQUEST, STATUS, IERR)
ENDIF
IF (MYID .LT. NPES-1) THENCALL MPI_SEND, SENDS MESSAGES TO PROCESSOR (MYID + 1)CALL MPI_IRECV, RECEIVES MESSAGES FROM PROCESSOR (MYID + 1)CALL MPI_WAIT ( REQUEST, STATUS, IERR)
ENDIF
The above calls allocate a communication request object and associate it with the requesthandle (the argument REQUEST). The request can be used later to query the status of the com-munication or wait for its completion. Here only nonblocking message communication is used.A nonblocking send call indicates that the system may start copying data out of the send buffer.The sender should not access any part of the send buffer after a nonblocking send operationis called, until the send completes. A nonblocking receive call indicates that the system maystart writing data into the receive buffer. The receiver should not access any part of the receivebuffer after a nonblocking receive operation is called, until the receive completes. We can seethe configuration of these communication in next section.
3 Multi-domain calculation
In this section we introduce the configuration for multi-domain computations. The commu-nication between any adjacent domains is conducted through finite-size overlaps. There is anunsteady flow generated by a convecting vortex in left-side domain(s) and a uniform flow inright-side domain(s). Some parameters are set in common use, such as the non-dimensionalvortex strength C/(U∞R) = 0.02, the sizes of domain −6 < X = x/R < 6, −6 < Y = y/R < 6,136 × 64 uniformly spaced points with a spacing interval 4X = 4Y = 0.41, and a timinginterval 4t = 0.005, where R is the nominal vortex radius.
The overlap method in Reference [5] is a successful two-domain calculation method. There-fore, we expand this method when developing the MPI Fortran version of the numerical solutionof the Navier-Stokes equations.
To illustrate data transfer between the domains, details of a five-point vertical overlap isshown in Figure 1. The vortex was initially located at the center of Mesh 1 (1st domain, solidlines). When computing begins, the vortex moves towards Mesh 2 (2nd domain, dashed lines)with increasing time. The vortex will not stop, but pass through the interface and then convectinto Mesh 2 after that. The meshes communicate through an overlap region. In Figure 1, forclarity, the overlap points are shown slightly staggered although they are collocated. At everytime-step, data is exchanged between domains at the end of each sub-iteration of the implicitscheme, as well as after each application of the filter. Each vertical line is denoted by its i-index.The data in columns IL− 4 and IL− 3 of the 1st domain were sent to columns 1 and 2 of the2nd domain, and the data in columns 4 and 5 of the 2nd domain were sent to columns IL − 1and IL of 1st domain.
5
1 2 4 51 2 4 5
multi-domain
on (i-1)th on (i+1)th
multi-domain
on ith CPU
multi-domain
IL-4 IL-3 IL-1 IL
on 2nd CPU
1 2 4 5IL-4 IL-3 IL-1 IL
on 1st CPU
IL-4 IL-3 IL-1 IL
2-domain2-domain
Single-domain on 1 CPU
! ! ! !! ! ! !! ! ! !! ! ! !! ! ! !! ! ! !
" " " "" " " "" " " "" " " "" " " "" " " "" " " "# # # ## # # ## # # ## # # ## # # ## # # #
$ $ $ $$ $ $ $$ $ $ $$ $ $ $$ $ $ $$ $ $ $$ $ $ $% % % %% % % %% % % %% % % %% % % %% % % %
& & & && & & && & & && & & && & & && & & && & & &' ' ' '' ' ' '' ' ' '' ' ' '' ' ' '' ' ' '
( ( ( (( ( ( (( ( ( (( ( ( (( ( ( (( ( ( (( ( ( (( ( ( () ) ) )) ) ) )) ) ) )) ) ) )) ) ) )) ) ) )
* ** ** ** ** ** ** ** ** ** ** ** *
++++++++++++
,,,,,,,,,,,,
------------
............
////////////
000000000000
111111111111
2 22 22 22 22 22 22 22 22 22 22 22 2
3 33 33 33 33 33 33 33 33 33 33 33 3
444444444444
555555555555
6 66 66 66 66 66 66 66 66 66 66 66 6
7 77 77 77 77 77 77 77 77 77 77 77 7
888888888888
999999999999
::::::::::::
;;;;;;;;;;;;
<<<<<<<<<<<<
============
Figure 2: The sketch map of sending and receiving messages among different processes forsingle-domain, two-domain, and multi-domain computations.
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
SET 34-35 at T=0 on domain 1SET 34-35 at T=0 on domain 2
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
SET 34-35 at T=6 on domain 1SET 34-35 at T=6 on domain 2
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
SET 34-35 at T=12 on domain 1SET 34-35 at T=12 on domain 2
Figure 3: The swirl velocity as a function of x, obtained in a two-domain calculation withperiodic end conditions are compared at three instants, T = 0 (upper-left), 6 (upper-right), and12 (bottom). Different colors were used to separate velocities of set IL = 34 + 35, JL = 34 ondifferent processors.
6
Multi-domain calculations are performed by using exactly the same principle as in single-domain computations. The only difference is that data are exchanged between adjacent domainsat the end of each scheme and after each filtering.
Expanding the configuration from two-domain to multi-domain, these transformation aretaken between each adjacent domains (processors). The computation is started with the vortexlocated at the center of Mesh i. With increasing time, the vortex convects into Mesh i + 1 afterpassing through the interface. The meshes communicate through an overlap region. Each meshhas IL number of vertical lines, which are denoted by its i-index, and JL number of horizontallines. The values at points 1 and 2 of Mesh 2 are equal to the corresponding values at pointsIL − 4 and IL − 3 of Mesh 1. Similarly, values at points 4 and 5 of Mesh 2 donate to pointsIL − 1 and IL of Mesh 1. The exchange of information is shown with arrows at each pointin the schematic. Note that although points on line IL − 2 (Mesh 1) and 3 (Mesh 2) are alsooverlapped, they do not need to communicate directly. This only occurs when the number ofoverlap points is odd (say, 5 here) and it is called the non-communicating overlap (NCO) point.Figure 2 shows the sketch of sending and receiving messages among different processes in ourprogram. A vertical interface is introduced into the curvelinear mesh as shown schematically.
0 2 4 6 8 100
100
200
300
400
500
600
700
800
Set: IL=136x2,JL=64 Set: IL=68x2,JL=34 Set: IL=34+35,JL=34
Run
ning
Tim
e (s
)
Number of Processors
0 2 4 6 8 10
1
2
3
4
5
6
7
8
Set: IL=136x2,JL=64 Set: IL=68x2,JL=34 Set: IL=34+35,JL=34
Spe
edup
Number of Processors
Figure 4: Running time and speedup for the set IL = 34 + 35, JL = 34, the set IL = 136 ×2, JL = 64, and the set IL = 68 × 2, JL = 64 versus the number of processors.
The number of each line overlap points can be larger and smaller, but there must be at leasttwo points. Reference [5] investigated some odd number of overlap points, specifically 3, 5, 7points, and the effect of overlap size. By increasing the overlapping size, the instabilities will alsoincrease, and the vortex will be diffused and distorted. In a 3-point overlap, each domain receivesdata only at an end point, which is not filtered. With larger overlaps, every domain needs toexchange several near-boundary points, that need to be filtered due to the characteristic of thePade-type formulas. However, filtering the received data can actually introduce instabilitiesat domain interfaces. On the other side, results show that the accuracy may increase with a
7
larger overlapping region. The 7-point overlap calculation, as opposed to the 3-point overlapcalculation can reproduce the single-domain calculation. To keep a balance, schemes with a5-point horizontal overlap give good performance of reproducing the results of the single domaincalculation while keeping the vortex in its original shape.
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
swirl
vel
ocity
x
set 136-64 at 2400 with 2 CPUs
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
swirl
vel
ocity
x
set 136-64 at 2400 with 5 CPUs
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
swirl
vel
ocity
x
set 136-64 at 2400 with 6 CPUs
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
-10 -5 0 5 10 15 20
swirl
vel
ocity
x
set 136-63 at 2400 with 10 CPUs
Figure 5: Swirl velocities of multi-domain calculation for the set with IL = 136 × 2 and JL =64 at T = 12 are shown as a function of the number of processors.
When specification of boundary treatment is important, the precise scheme will be denotedwith 5 numbers, abcde where a through e are numbers denoting the order of accuracy of theformulas employed respectively at points 1, 2, 3· · · IL − 2, IL − 1 and IL respectively. Forexample, 45654, consists of a sixth-order interior compact formula, which degrades to 5th- and4th-order accuracy near each boundary. Explicit schemes in this 5 number notation will beprefixed with an E. The curvlinear mesh in this paper is generated with the parameters xmin= ymin = -6, xmax=18, ymax=6, Ax = 1, Ay = 4, nx = ny = 4, and φx = φy = 0. The 34443scheme with a 5-point overlap is used to generate results.
8
4 Performance
All of the experiments were performed on a 24 R10000 processors SGI Origin 2000 distributed-shared memory multiprocessor system. Each processor has a 200MHz CPU with 4 MB of L2cache and 2 Gbytes of main memory. The system runs IRIX 6.4.
Begin with a theoretical initial condition, the swirl velocity,√
(u− u∞)2 + v2, obtained intwo-domain calculation with periodic end conditions are compared at three instants, T = 0(shown in upper-left-side figure), 6 (shown in upper-right-side figure) and 12 (shown in bottom-side figure), respectively. At time T = 0, the vortex center is located at X = 0 and, as timeincreases, it convects to the right at a non-dimensional velocity of unity. The two-domain resultsare clearly demonstrated in Figure 3 which shows the swirl velocity along Y = 0 at T = 12.The discrepancy in peak swirl velocity between the single- and two-domain computations isnegligible. The vortex is well-behaved as it moves through the interface. The contour linesare continuous at the non-communicating overlap point. It shows that the two-domain resultsreproduce the single-domain values quite well. It also indicates that the higher-order interfacetreatments are stable and recover the single domain solution.
The serious problem we first confront is the correctness of our computation. Comparing thetwo-domain computation with the manual treatment by Reference [5], it is sure that the resultsreproduced by the MPI parallel code were found to be in excellent agreement with those inReference [5]. There are not any difference between them.
Speedup figures describe the relationship between cluster size and execution time for a par-allel program. If a program has run before (maybe on a range of cluster sizes), we can recordpast execution times and use a speedup model to summarize the historical data and estimatefuture execution times. Here, the speedup for the IL = 136 × 2 and JL = 64 set versus thenumber of processors used is shown in Figure 4 for MPI based parallel calculations. From thefigure, we can see that the speedup is close to 8 for 10 CPUs. The reason is that, in five-pointoverlap treatment, the computation on each processor is almost independent while the costs ofthe communication between the processors can be disregarded. The speedup is a linear functionof the number of processors. It shows that our parallel code with MPI has good performanceon partition.
The single-domain computation with the same conditions is also included in Figure 4. Sincetwo different input files exist, we can not use MPI based parallel code to calculate using only 1CPU (processor). The problem can be solved later if one input file is needed. On the other side,the limitation of maximum number of processors depends on the size of horizontal points, IL.Suppose we have a set with IL = 34. More than 6 processors may output error messages becauseeach domain must contain at least about 5 points for exchanging with its adjacent domain.
Results of multi-domain calculations for the set with IL = 136 × 2 , JL = 64 are shown inFigure 5. In the figure, the swirl velocities at T = 12 are shown as a function of the number ofprocessors. With a different number of processors, the shape of swirl velocities are very close.This shows that our multi-domain calculation on a multiprocessor system is consistent withsingle domain results.
5 Conclusion
In this paper, multi-domain high-order full 3-D Navier-Stokes equations were studied. MPI wasused to treat the multi-domain computation, which is expanded from the two-domain interfaceused previously. One single-domain is broken into up to 2n domains with 2n - 1 overlaps. Fivefinite-sized overlaps are employed to exchange data between adjacent multi-dimensional sub-domains. It is general enough to parallelize the high-order full 3-D Navier-Stokes equations
9
into multi-domain applications without adding much complexity. Parallel execution results withhigh-order boundary treatments show strong agreement between multi-domain calculations andthe single-domain results. These experiments provide new ways of parallelizing 3-D Navier-Stokes equations efficiently.
References
[1] D. A. Anderson, J. C. Tannehill, and R. H. Pletcher, Computational Fluid Mechanics andHeat Transfer, McGraw-Hill Book Company, 1984.
[2] M. R. Visbal and D. V. Gaitonde, High-Order Accurate Methods for Unsteady VorticalFlows on Curvilinear Meshes. AIAA Paper 98-0131, January 1998.
[3] S. K. Lele, Compact Finite Difference Schemes with Spectral-like Resolution. Journal ofComputational Physics, 103:16-42, 1992.
[4] D. V. Gaitonde, J.S. Shang, and J. L. Young, Practical Aspects of High-Order AccurateFinite-Volume Schemes for Electromagnetics. AIAA Paper 97-0363, Jan. 1997.
[5] D. V. Gaitonde and M. R. Visbal, Further Decelopment of a Navier-Stokes Solution Pro-cedure Based on Higher-Order Formulas, AIAA 99-0557, 37th AIAA Aerospace SciencesMeeting & Exhibit, Reno, NV, Jan. 1999.
[6] W. Gropp, E. Lush, and A. Skjellum, Using MPI Portable Parallel Programming with theMessage-Passing Interface, MIT press, Cambridge, Massachusetts, 1994.
[7] P. Pacheco, Parallel programming with MPI, Morgan Kaufmann, San Francisco, California,1997.
[8] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, and J. Dongarra, MPI The CompleteReference, MIT Press, Cambridge, Massachusetts, 1996.
[9] B. Wilkinson, and M. Allen, Parallel Programming, Prentice Hall, Upper Saddle River, NewJersey, 1999.
10
Developing an Object-oriented Parallel Iterative-Methods Library
Chakib OuarraouiDavid Kaeli
Department of Electrical and Computer EngineeringNortheastern University
ochakib,[email protected]
Abstract
In this paper we describe our work developing an object-oriented parallel toolkit on OO-MPI. We are developing a parallelized implementation of the Multi-view Tomography Toolbox,an iterative solver for a range of tomography problems. The code is developed in object-orientedC++. The MVT toolbox is presently used by researchers in the field of tomography to solve lin-ear and non-linear forward modeling problems. This paper describes our experience parallelizingan object-oriented numerical library, which comprises a number of iterative algorithms. In thiswork, we present a parallel version of BiCGSTAB algorithm and the block-ILU preconditioner.These two algorithms are implemented in C++ using OOMPI (Object-Oriented MPI), run ona 32-node Beowulf cluster. We also demonstrate the importance of using threads to overlapcommunication and computation, as an effective path to obtain improved speedup.keywords: parallel computing, OOMPI, multi-view tomography, iterative methods.
1 Introduction
Developing object-oriented programs on a parallel programming environment is an emergingarea of research. As we begin to utilize an object-oriented programming paradigm, we will need torethink how we develop libraries in order to maintain the encapsulation and polymorphism providedby the language. In this paper we develop parallel versions of methods associated with the IML++library [3]. Our goal is to get an efficient implementation of the library that can be run on a BeowulfCluster. The IML++ library is C++ code that contains a number of iterative algorithms such asBiconjugate Gradient, Generalized Minimal Residual, and Quasi-Minimal Residual. In order todevelop a parallel version of the IML++ library, each individual algorithm needs to be parallelized.
To develop a parallel implementation of an object-oriented library, we need a middleware thatworks seamlessly with an object-oriented programming paradigm. OOMPI is an object-orientedmessage passing interface [7]. The code provides a class library that encapsulates the functionalityof MPI into a functional class hierarchy, providing a simple and flexible interface. We will useOOMPI as we attempt to parallelize the MVT Toolbox on a Beowulf cluster.
This paper is organized as follows. In Section 2 we will describe the Multi-view TomographyToolbox. In Section 3 we will discuss the design of the IML++ library. In Section 4 we will gointo the details of the parallel BiCGSTAB algorithm and the ILU preconditioner. In Section 5 wediscuss the issue of scalability as a function of the amount of communication inherent in the code.Finally in Section 6 we summarize the paper and discuss directions for future work.
2 MVT Toolbox
The MVT Toolbox provides researchers in the many tomography fields with a collectionof generic tools for solving linear and non-linear forward modeling problems [6]. The toolbox isdesigned for flexibility in establishing parameters to provide the ability to tailor the tool to specifictomography problem domains. This library provides a numerical solution to the partial differentialequation of the form
· σ(r) φ(r) + k2(r)φ(r) = s(r), (1)
where r is a point in 3-space; σ(r) is the conductivity and can be any real-valued function of space;k2(r) is the wave number and can be any complex-valued function of space; φ(r) is the field forwhich we are solving; s(r) is the source function; and the boundary conditions can be Dirichlet,Neumann or mixed.
The MVT Toolbox is designed as a numerical solver for problems of the general form definedby Equation 1. The Toolbox can solve this problem on a regular or semi-regular Cartesian grid.It has the ability to model arbitrary sources as well as receivers. It also can use a finite-differenceapproach for discretization of the equation defined before and associated boundary conditions in3-D space. In this case, the problem of solving the differential equation for the field is reduced tosolving a large sparse system of linear equations of the general form Af = s, where the matrix Arepresents the finite difference discretization of the differential operator, as well as the boundaryconditions. The vector f contains the sample values of the field at the points on the grid. Thevector s holds the discretized source function. IML++ package is used as the main solver for theequation Ax = b.
3 IML++ Library
The IML++ package [3] is a C++ version of set of iterative methods routines. IML++ wasdeveloped as part of the Templates Project [2]. This package relies on the user to supply matrix,vector, and preconditioner classes to implement specific operations. The IML++ library includesa number of iterative methods that support the most commonly used sparse data formats. Thelibrary is highly templatized such that it can be used on different sparse matrix formats, togetherwith a range of data formats. The library is described in [2], and contains the following iterativemethods:
• Conjugate Gradient (CG)
• Generalized Minimal Residual (GMRES)
• Minimum Residual (MINRES)
• Quasi-Minimal Residual (QMR)
• Conjugate Gradient Squared (CGS)
• Biconjugate Gradient Squared (BiCG)
• Biconjugate Gradient Stabilized (Bi-CGSTAB)
4 Parallel MVT
To develop a parallel implementation of the MVT Toolbox, we have utilized a profile-guidedapproach. The main advantage of this technique is that there is no need to delve into the detailsof the library (source code for the library may not be available). Profiling allows us to concentrateonly on the hot portions of the code.
The first step in our work is to profile the serial code on a uni-processor machine. We will thendevelop a parallel version of the hot code.
4.1 Profiling the MVT Toolbox
To profile the Toolbox, we used the technique described in [1, 4]. A similar technique hasbeen used to profile IO accesses [8]. Using this technique, the serial code is profiled on singleprocessor machine. The GNU utility gprof was used to profile the library. The results show thatmost of execution time is spent in the Iterative Method Library (IML++). So we chose to focusour parallelization work on the IML++ library.
4.2 Parallelization of Object-oriented Code
The IML++ library code computes solutions for different sources and frequencies. Thereforethe easiest way to parallelize the toolbox is parallelize on a coarse grain, and compute each sourceon a node of the cluster. However, in general, the number of modeled sources is less than thenumber of processors, so the available resources will be underutilized. Another option is to follow ahybrid approach where each source is solved on a group of processors. While this approach obtainsgood results, it is still necessary to parallelize the IML++ library.
4.3 Parallel IML++
To begin to parallelize the IML++ library, we need to consider the various iterative algorithmsincluded with the package. Each individual algorithm needs to be parallelized. In this section we willexplore a parallel implementation of the Biconjugate Gradient Stabilized algorithm (BiCGSTAB),as well as a parallel version of the Incomplete LU factorization code.
4.3.1 BiCGSTAB Algorithm
The BiCGSTAB algorithm is a Krylov subspace iterative method for solving a large non-symmetric linear system of the form Ax = b. The algorithm extracts approximate solutions fromthe Krylov subspace. BiCGSTAB is similar to the BiCG algorithm except that BiCGSTAB providesfor faster convergence. The pseudocode for the BiCGSTAB algorithm is provided in Figure 1.
4.3.2 Parallel BiCGSTAB Algorithm
Iterative algorithms can be challenging to parallelize. In the BiCGSTAB algorithm, there isa loop-carried dependence in the algorithm. If we attempt to parallelize the algorithm using thisform of the code, a synchronization barrier is required after each iteration, so most of the time willbe spent synchronizing on shared access to the data. There has been related research on minimizingthe number of synchronizations. Following the approach described by Laurence [9], the BiCGSTABalgorithm can be restructured so that there is only one global synchronization by redefining andsplitting the equations of the original algorithm. For example, vn can be written as
1 : r0 = b−Ax0
2 : ρ0 = α0 = ω0 = 1;3 : v0 = p0 = 0;4 : for n = 1, 2, 3,...do5 :5 : ρn = rT
0 rn−1;6 : β = ρn
ρn−1
αnωn−1
;7 : pn = rn − 1 + βn(pn−1 − ωn − 1vn−1);8 : vn = Apn;9 : αn = ρn
rT0 vn
;
10 : sn = rn−1 − αnvn;11 : tn = Asn;12 : ωn = tTn sn
tTn tn;
13 : xn = xn−1 + αnpn + ωnsn;14 : if xn is accurate enough then15 : STOP16 : endif17 : rn = sn − ωntn;18 : endfor
Figure 1: The BiCGSTAB Algorithm.
vn = Apn = A(rn−1 + βn(pn−1 − ωn−1vn−1))= un−1 + βnvn−1 − βnωn−1qn−1
The same thing can be done for all vectors so that φn = r0sn, andτn = r0vn.These code transformations make the task of parallelization more straightforward. A modified
version of the BiCGSTAB algorithm is presented in Figure 2. In the modified BiCGSTAB algorithm,we can clearly see that the inner products in steps 13–18 do not involve loop-carried dependencies.The vector updates in lines 10 and 11 are independent, and the vector updates in lines 21 and 22are also independent.
4.3.3 Implementation of Parallel BiCGSTAB
IML++ is a highly templatized library that provides the user with the ability to reuse the codefor different problem classes and data formats. To maintain this polymorphism when moving to aparallel environment, some modifications needed to be made to the structure of the MVT Toolbox.The parallel implementation of the BiCGSTAB algorithm requires the addition of new methods tothe vector and matrix classes. These methods basically set the range for the computations. Makingthese changes will help to ensure the efficiency and reusability of the classes. We have other optionsavailable that still maintain the object-oriented nature of that code and these options would notimpact main library. One solution would be to cast the input vector and matrix into a newlydefined class that contains the needed methods. However this would introduce additional memoryoverhead. Another option would be to extract the data from the input vectors and matrices, butthis would add additional processing overhead on each access to the data.
So the option we arrived at requires that the input vector and matrix classes provide a new
1 : r0 = b−Ax0, u0 = Ar0, f0 = AT r0, q0 = v0 = z0 = 0;2 : σ−1 = π0 = φ0 = τ0 = 0, σ0 = rT
0 u0, ρ0 = α0 = ω0 = 1;3 : forn = 1, 2, 3, ...do4 : ρn = φn−1 − ωn−1σn−2 + ωn−1αn−1πn−1;5 : δn = ρn
ρn−1αn−1, β = δn
ωn−1;
6 : τn = σn−1 + βnτn−1 − δnπn−1;7 : αn = ρn
τn;
8 : vn = un−1 + βnvn−1 − δnqn−1;9 : qn = Avn;
10 : sn = rn−1 − αnvn;11 : tn = un−1 − αnqn;12 : zn = αnrn−1 + βnzn−1 − αnδnvn−1;13 : φn = rT
0 sn;14 : πn = rT
0 qn;15 : γn = fT
0 sn;16 : ηn = fT
0 tn;17 : θn = sT
n tn;18 : κn = tTn tn;19 : ωn = θn
κn;
20 : σn = γn − ωnηn;21 : rn = sn − ωntn;22 : xn = xn−1 + zn + ωnsn;23 : if xn is accurate enough then24 : STOP25 : end if26 : un = Arn;27 : end for
Figure 2: Parallel BiCGSTAB Algorithm.
0
0.5
1
1.5
2
0 5 10 15 20
# Nodes
Rel
ativ
e S
pee
du
p
Figure 3: Speedup of BiCGSTAB
class methods that define the range for the computation. Our selection was made based on theneed to keep the BiCGSTAB algorithm highly reusable.
We developed an initial parallelized version of the code using OO-MPI and ran it on our 32-nodeBeowulf Cluster (using only 24 nodes). Our initial performance results were somewhat surprising,only obtaining a 2X speedup on a 16-node configuration (see Figure 3).
To better understand the performance bottlenecks in this code, we used method-level profiling.The goal was to identify both the hot portions of the program (based on the number of method calls)and the methods where the program spends most of its execution. Figure 4 shows the percent oftime spent in ILU while varying the number of compute nodes. We found that the preconditionerconsumes a significant amount of the execution, so we next parallelized the ILU preconditionercode.
4.4 ILU Preconditioner
Preconditioners are used to accelerate convergence of an algorithm. The basic idea is to finda preconditioner for a linear system Ax = b that is a good approximation to A so that the samesystem is much easier to solve than the original system. Mathematically, the system can be solvedusing: M−1Ax = M−1b by any iterative method. The preconditioner is not always easy to find.There are many techniques used to arrive at the best preconditioner. Incomplete factorizationpreconditioners are commonly used since they are suitable for any type of matrix. The pseudocodeof the ILU algorithm is given in Figure 5.
4.5 Parallel ILU Preconditioner
There are a number of parallel algorithms that implement the ILU method. However it is hardto find a parallel ILU algorithm that is scalable. Most ILU methods are sequential in nature andthus they are difficult to parallelize on a cluster. In this paper we utilize the block ILU algorithmdescribed in [5].
05
1015202530354045
1 2 4 8 16
% t
ime
spen
t on
IL
U
Figure 4: Profiling of the ILU Preconditioner Code
Algorithm 4.1: ILU(M, x, D)
for i← 1 to nfor j ← 1 to i
v = v + A(i, j) ∗ r(i)v = x(i)− v
for i← n− 1 to 1for j ← n− 1 to i
v+ = A(i, j) ∗ w(i)v = r(i)− vv = v ∗D(i)w(i) = v
Figure 5: The ILU Algorithm
P0
P1
P2
P3
Overlapped region
Figure 6: Block-ILU Algorithm
The block ILU chosen splits the matrix into blocks, assigning each block to a processor. Thistechnique involves no synchronization between processors, which makes it potentially scalable onclusters. The block parameter that defines the size of the overlap region is controllable. The matrixis divided into overlap blocks, and each block is stored in a single processor. Figure 6 describesthis partitioning method. In this example, the original matrix is partitioned into 4 overlap blocks.Each block is assigned to a processor. The size of the overlap region is controllable, which helpsthe user to search for a better solution. The size of the overlap region is very important since alarge overlap region will affect the performance of the code significantly. Using a small overlap willaffect the quality and the convergence of the solution.
Now that we have described the parallel BiCGSTAB and parallel ILU preconditioner, we needto run our parallel implementation on our cluster. Runtime results are shown in Figure 7. Theparallelized version runs much faster and obtains better scalability. The speedup is close to 4 ona 8 processor machine (thought degrades when additional processors are used). This is not theresult that were hoping for. We hoped to get linear scalablity. So we continued to profile the code,looking for additional opportunities for tuning.
We next focus on the communication overhead present in the code. To obtain this profile,we inserted trace points into the code before and after each synchronization point. We obtaineda profile of the total communication time and the communication time for each synchronizationpoint.
Looking at the results in Figure 8, we can see that communication time for 2, 4, 8, 16 is about20%, 30%, 35%, 44% of the total time. Obviously, we have scalability issues here. To address thisproblem, we need to hide some of this communication. One solution is to overlap communicationwith computation. However in our case, we are using a blocking Allreduce() method. A non-blocking version of the Allreduce() method is not provided. By using a non-blocking version weexpect the speedup to be closer to 8 on a 16 processor system.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20
# Nodes
Rel
ativ
e Sp
eedu
p
Figure 7: Speedup of the BiCGSTAB using Block-ILU
0102030405060708090
100
1 2 4 8 16
# Nodes
% C
omm
unic
atio
n T
ime
Figure 8: Communication profile for the code.
0
0.51
1.52
2.53
3.54
4.5
0 5 10 15 20
# Nodes
Rel
ativ
e Sp
eed
Original Threaded
Figure 9: Original vs. Threaded version Speedup.
4.6 Threaded Communication
In our work, we have found that there exist scalability issues which are caused by the com-munication overhead. A straight-forward approach to address this problem is to allow overlap thecommunication and computation. In order to implement this solution, we needed to develop a num-ber of non-blocking functions, specifically the Allreduce() and Allgather() functions. Rewritingthe Allreduce() call using Isend and Irecv is not a good solution since there is a need to synchro-nize all processors at each processing step (i.e., too much communication). Next, we describe oursolution to this problem using threads.
Our solution is to spawn a communication thread on each processor that only handles MPIcalls such as Allreduce() and Allgather(). Once the communication thread is started, it goes tosleep and waits for the main process to issue a wake-up signal. The main process spawns a wakeupcommunication thread, which runs in parallel to the main process which is performing independentcomputation. Once the main process is done with the computation, it waits for the communicationthread to finish communication with the other processors. This is an efficient solution since thecommunication thread will most likely be sleeping, waiting for a command from the main processor waiting for an MPI call to complete. However there is also the overhead associated with threadcreation and synchronization with the main process. If the application involves a significant amountof computation, higher performance will result using our thread-based synchronization scheme. Onthe other hand, if the application is communication-dominated, then the main process will bestalled waiting for communication to complete.
As shown in Figure 9, we obtained improved speedup for the threaded version of the parallelizedcode, up to 8 nodes. However, when using 16 nodes, the execution is still dominated by commu-nication. Since the problem is divided over more nodes, the problem size is smaller so less timeis spent on computation. Also, communication overhead will be higher as we add computationalnodes, and additional overhead for thread synchronization will result.
5 Discussion
It should be clear from our results that this algorithm is not scalable on the system used inour performance study. This cluster utilizes 100 megabit/second switched (i.e., Fast) Ethernet as acommunication fabric. It is clear that communications are the bottleneck for this algorithm. We arepresently moving our work to a gigabit cluster system, and expect to obtained better speedup. Thelatency of Gigabit Ethernet is very similar to that of Fast Ethernet; they only differ in bandwidth(there would be little difference in performance if the application is only exchanging small messagesin the absence of message collisions). In our case, the algorithm also needs to gather a vectoron all processors, so we should expect a performance increase. To estimate the total time of thecommunication overhead we developed a model 2, where T is the total time, T0 is the latency, Mthe message size in bytes and r∞ is the bandwidth.
T = T0 +M
r∞(2)
From the algorithm, we know the size of each message and the frequency of each MPI call, sowe can estimate the total time due to communication overhead. In this estimate, we assume thatAllgather() and Allreduce() are executed in 4 steps on a 16 node cluster. Applying these valuesto our model (Equation 2) gives us 34 sec for Fast Ethernet and 3.34 sec for Gigabit Ethernet.We clearly can see that the experimental results for Fast Ethernet are very close to the resultsobtained from our model. This suggests that the communication overhead obtained when we moveto a Gigabit network will be one-tenth of the amount seen when using Fast Ethernet. This wouldprovide us with a speedup of near 10 on a 16-node cluster. Since gigabit-connected clusters willshortly become the de-facto network fabric, our parallel library will provide users with effectivepathway to obtain scalable performance.
6 Summary
In this paper we presented the parallelization of an object-oriented library that implementsthe BiCGSTAB and ILU algorithms. This work is still in progress.
We have shown the importance of using threads to overlap communication and computation.Our initial results on Fast Ethernet showed that these algorithms are inherently dominated bycommunications. However, we have studied the amount of communication in these algorithmsand predict that improved scalability will be obtained on a gigabit-connected cluster. Some nextsteps are to develop parallel versions of the CG, BiCG, GMRES and QMR algorithms, and thenintegrate these into the parallelized IML++ library. We have shown that the parallel iterativemethod library is both an object-oriented interface to speed code development and also will beable to obtain reasonable parallel speedup on evolving network fabrics. We plan to explore usingblocked versions of some the algorithms, which initially appears to be a promising direction for ourwork. We also plan to improve the design, targeting increased flexibility and reusability.
7 Acknowledgements
This work is funded by CenSSIS, the Center for Subsurface Sensing and Imaging Systems, underthe Engineering Research Centers Program of the NSF (Award Number EEC-9986821).
References
[1] M. Ashouei, D. Jiang, W. Meleis, D. Kaeli, M. El-Shenawee, E. Mizan, Y. Wang, C. Rappaport, andC. Dimarzio. Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applica-tions. Journal of Systems, Science and Technology, pages 40–55, 2002.
[2] R. Barrett, M. Berry, T.-F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine,and H. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for IterativeMethods. SIAM, Philadelphia, PA, 1994.
[3] J. Dongarra, A. Lumsdaine, X. Niu, R. Pozo, and K. Remington. A Sparse Matrix Library in C++ ForHigh Performance Architectures. In Proceedings of the 2nd Object Oriented Numerics Conference, pages214–218, 1994.
[4] M. El-Shenawee, C. Rappaport, D. Jiang, W. Meleis, and D. Kaeli. Electromagnetics ComputationsUsing the MPI Parallel Implementation of the Steepest Descent Fast Multipole Method (SDFMM).ACES Journal, 17(2):112–122, 2002.
[5] P. Gonzalez, J. C. Cabaleiro, and T. F. Pena. Parallel Incomplete LU Factorization as a Preconditionerfor Krylov Subspace Methods. Parallel Processing Letters, 9(4):467–474, 1999.
[6] The Multi-View Tomography Toolbox. URL: http://www.censsis.neu.edu/mvt/mvt.html, Center forSubsurface Sensing and Imaging Systems, 2003.
[7] J. Squyres, B. McCandless, and A. Lumsdaine. Object Oriented MPI: A Class Library for the MessagePassing Interface. In In Proceedings of the POOMA Conference, 1996.
[8] Y. Wang and D. Kaeli. Profile-Guided I/O Partitioning. In Proceedings of the International Conferenceon Supercomputing, pages 252–260, 2003.
[9] L. T. Yang and R. P. Brent. The improved BiCGStab method for large and sparse unsymmetric linearsystems on parallel distributed memory architectures. In Proceedings of the Fifth International Confer-ence on Algorithms and Architectures for Parallel Processing, pages 324–328, Beijing, 2002. ICA3PP-02.
A SUPER-PROGRAMMING TECHNIQUE FOR LARGE SPARSE MATRIX
MULTIPLICATION ON PC CLUSTERS
Dejiang Jin and Sotirios G. Ziavras
Department of Electrical and Computer Engineering
New Jersey Institute of Technology
Newark, NJ 07102
Abstract. The multiplication of large spare matrices is a basic operation for many scientific and engineering
applications. There exist some high-performance library routines for this operation. They are often optimized
based on the target architecture. The PC cluster computing paradigm has recently emerged as a viable alternative
for high-performance, low-cost computing. In this paper, we apply our super-programming approach [24] to study
the load balance and runtime management overhead for implementing parallel large matrix multiplication on PC
clusters. For a parallel environment, it is essential to partition the entire operation into tasks and assign them to
individual processing elements. Most of the existing approaches partition the given sub-matrices based on some
kinds of workload estimation. For dense matrices on some architectures estimations may be accurate. For sparse
matrices on PC, however, the workloads of block operations may not necessarily depend on the size of data. The
workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may
degrade performance. Moreover, in a heterogeneous environment, statically partitioning is NP-complete. For
embedded problems, it also introduces management overhead. In this paper We adopt our super-programming
approach that partitions the entire task into medium-grain tasks that are implemented using super-instructions; the
workload of super-instructions is easy to estimate. These tasks are dynamically assigned to member computer
nodes. A node may execute more than one super-instruction. Our results prove the viability of our approach.
Keywords. PC cluster, matrix multiplication, load balancing, programming model, performance
evaluation.
1. Introduction Matrix multiplication (MM) is an important linear algebra operation. A number of scientific and engineering
applications include this operation as building block. Due to its fundamental importance, much effect has been
devoted to studying and implementing MM [1-8]. MM has been included in several libraries [1, 6, 9-12,14. Some
MM algorithms have been developed for parallel systems [2, 3, 5, 13, 15, 16]. In many applications, the matrices
are sparse. Although all sparse matrices can be treated as dense matrices, the sparsity of matrices can be exploited
to gain high performance. For parallel MM, the entire task should be decomposed. This introduces various
1
overheads. The most important are the communication and load balancing overheads. Partitioning matrices into
sub-matrix blocks and decomposing the MM are often dependent technology [4,13]. Applying special
implementations for sparse matrix operation to sub-matrix blocks may improve performance. It also makes the
workload of sub-matrix operation to be diverse. The information about workload of a task for sparse matrix
operation to sub-matrix blocks may not be available at compile time or even the time of initialing subroutines. It
is only available after these routine programs have been executed. This increases the complexity of load
balancing. So it is often ignored.
Most algorithms are optimized based on the characteristics of targeting platform [2,3,5,15,16]. The PC cluster
computing platform has recently emerged as a viable alternative for high-performance, low-cost computing.
Generally, the PCs of a cluster have a lot of resource, which can be used simultaneously. They have relatively
weak communicating capability. So far, they lack high performance implementation support for data
communication as efficient as supercomputer. It only has some communication channels implemented by
software. Data communication still compete CPU resource with computation. So we need a new programming
model to fit PC-cluster architecture better.
MM operations for sparse matrices are embedded in many host programs. The optimal matrix partitioning has
to be performed at run time rather than at compile time even if MM algorithm adopts “static strategy”. Parameters
must be optimized when sub-routine of MM is launched. This introduces an additional management overhead. In
heterogeneous environments, finding the optimal partitioning is an NP-complete problem [13]. How we can
reduce this kind of impact still needs to be considered to find a high performance solution for MM on PC clusters.
To address this problem, we use in this paper our recently introduced super-programming model (SPM) [24].
Using SPM, we analyze the performance of Super programs for MM, and discuss the effect of various strategies.
2. Super Program Development
2.1 Super Programming Model
In the super-programming model, the parallel system is modeled as a single virtual machine with multiple
processing nodes (member Computers). [24] Application programs are modeled as super-programs (SPs) that
consist of a collection of basic abstract tasks. These tasks are modeled as super-instructions (SIs). SIs are atomic
workload units. SPs are coded with SIs. SIs are dynamically assigned to available processing units at run time,
thus rendering the task assignment process a simpler task.
In SPM, the key elements are super-instructions. They are high-level abstract operations to solve application
problems. In assembly language programming, each instruction in a program is indivisible and is assigned to
respective functional units to execute. Similarly to assembly language programming, in SPM, each SI also is
indivisible. Each SI can only be assigned and executed on a single node. At the PC executable level, they are
some routines implementing frequently used operations for the corresponding application domain. Each SI can
2
execute on any member computer in the PC cluster. In heterogeneous environment, for a SI, there are many
different executables corresponding to different computers.
The key difference of SPM compared to most of the other high-level programming models is that these super-
instructions are designed in a workload-oriented manner rather than function-oriented. (More detailed
comparisons follow in Section 4). They have limited workload rather than arbitrary workload. An example of
such a SI is “multiply a pair of matrices in which the number of elements is not more than k”. k is a design
parameter. Programmers or compilers can expect that it can be implemented within a known maximum execution
time on a given type of processing unit. In this manner, the system can schedule the task efficiently, thus
minimizing the chance of member nodes being idle and reducing overhead of parallel computing. When the
degree of parallelism in the SP, measured in number of super-instructions, is much larger than the number of
nodes in the cluster, any node has little chance to be idle.
Super-instructions are basic atomic tasks with workloads that have a quit accurate upper bound. Application
programs usually consist of logic functions based on the adopted algorithm. Application programs developed
under this SPM are composed of SIs. These programs still can group SIs for assignment to logic function units.
We call such logic functional units super-functions. A super-function achieves a logic operation with arbitrary
workload. SI is workload-oriented. An SI may not achieve a complete logic operation but a part of a logic
operation with predefined maximum workload. In SPM, super-functions are implemented with SIs. A super-
function with little workload may be completed in one SI. In the general case, based on the designed parameters, a
super-function, however, cannot be completed in just one SI, because of its limited workload. A number of SIs
may be required. Breaking a super-function into a lot of SIs may increate the capability of the PC cluster to
exploit the high-level parallelism in super-programs. As mentioned earlier, duo to communication overheads, it is
difficult for PC clusters to exploit low-level parallelism. Thus, to successfully achieve computation parallel, it is
critical to sufficiently exploit high-level parallelism. Independently assigning SIs, rather than super-functions,
provides more high-level parallelism. Extending the functional unit to handle multiple SIs makes the high-level
parallelism not only to be determined by the algorithm but also by the SI designer. Increasing the degree of high-
level parallelism makes easier the task of balancing the workload.
To effectively support application portability among different computing platforms and to also balance the
workload in parallel systems, an effective instruction set architecture (ISA) should be developed for each
application domain under SPM. These operations then become part of the chosen instruction set. Application
programs should maximize the utilization of such instructions in the coding process. Then, as long as an efficient
implementation exists for each SI on given computing platforms, code portability is guaranteed and good load
balancing becomes more feasible by focusing on scheduling at this coarser SI level.
2.2 Super-Instruction Set
An effective ISA should be developed for each application domain. These super-instructions should be
frequently used operations for the corresponding application domain. Matrix multiplication is used as a basic
3
operation in many different applications. The program is often embedded in these host programs. So the super-
instruction set for MM should be a common sub set of ISAs for these application domains. Also addition and
multiplication of sub-matrix blocks are good candidates of super-instructions. Although matrix multiplication
inherently involves the addition and multiplication of elements of operands matrices, it, however, can be achieved
through the addition and multiplication of arbitrarily partitioned sub-matrices. In such a block-wise approach, the
operands and products are all treated as much smaller matrices with elements also being matrices. Because the
computing workload of multiplication and addition are determined by the size of the sub-matrix blocks, their
workload can be easily limited by reducing the size of blocks.
The sparsity in matrices can be exploited to reduce the workload of these operations to improve the
performance of the entire program. When using specific algorithms, the workload of these operations increases
with the sparsity of the sub-matrix block but not more than that of the dense matrix.
For sparse matrices, generally, the non-zero elements may not be distributed in uniform manner. This means
the sub-matrix blocks may have different sparisity. Some blocks may be sparser than entire matrix but some
others may be denser than the entire matrix or may not even have any zero element. Since the workload of
multiplication implemented with an optimal algorithm for sparse matrices depends on the sparsity of the matrix,
this diversity of sparsity for sub-matrix blocks makes the workload of sub-matrix multiplication to vary
substantially. We have chosen following super-instructions:
1) denseToDenseMultiply( A, B, C): A is an n1 x n2 dense matrix; B is an n2 x n3 dense matrix; C is an n1 x n3
dense matrix. The functionality of this SI is C = C+ A*B. n1, n2, and n3 are all not greater than a pre-
determined value k1.
2) sparseTodenseMultiply( A, B, C): A is an n1 x n2 sparse matrix; B is an n2 x n3 dense matrix; C is an n1 x n3
dense matrix. The functionality of this SI is C = C+ A*B. n1, n2, and n3 are all not greater than a pre-
determined value k2.
3) sparseToSpareMultiply(A, B, C): A is an n1 x n2 sparse matrix; B is an n2 x n3 sparse matrix; C is an n1 x n3
potentially dense matrix. The functionality of this SI is C = C+ A*B. n1, n2, and n3 are all not greater than a
pre-determined value k3.
2.3 A Super-Program for Matrix Multiplication
We first discuss MM for distributed-memory parallel computers. We focus on an implementation for the
ScaLAPACK library using 2D homogeneous grids [11]. We assume the problem C= A*B, where A, B and C are
N x N square matrices; the elements of these matrices have the same size square sub-matrix blocks. We also
assume that the computer system has n = p2 nodes configured in a 2D square grids; Also N can be divided evenly
by p so that N/p = q. The three matrices share the same layout over the 2D grid as follows: Each processor node
Pr,s (0 ≤ r, s< p ) holds all of the Ai,j, Bi,j and Ci,j sub-matrix blocks where i = rq+i’, j = sq+j’and 0 ≤ i’, j’< q .
Thus, any sub-matrix block Xi,j (for X = A, B, C) is deterministically held by node Pr,s , where r = i /q and s = j /q .
4
Then, the MM algorithm is divides MM in N steps, where each step involves three phases. In the k-th step, where
k = 1, 2, …, N, the three phases are:
1. Every node Pr,s that holds Ai, k (for all i =0, 1, … , q-1) multicasts the matrix block horizontally to all nodes
Pr,* , where “*” stands for any qualified number.
2. Every node Pr,s that holds Bk, j (for all j =0, 1, … , q-1) multicasts the matrix block vertically to all nodes
P*,s
3. Every node Pr,s performs the update Ci,j = Ai,k* Bk,j + Ci,j , for all Ci,j held by the node.
Each node must have some temporary space to receive 2q matrix blocks.
Let us now focus on the SPM implementation of MM. For a super-program of MM in SPM, there are N3 SIs.
They are I i,j,k for updating Ci,j = Ai,k* Bk,j + Ci,j , for all of i, j, k =0, 1, … , q-1. If i≠ i’ or j≠ j’, then I i,j,k and I
i’,j’,k’ can be executed in parallel. If only k≠ k’, then I i,j,k and I i,j,k’ , however, can only be executed in sequence.
Since we lack a simple addition SI, they have to be executed on the same node. Under these constrains, any
scheduling policy is acceptable.
3. Scheduling Strategies
Under the SPM model, super-programs can be implemented under various scheduling policies. Because of
various factors that affect performance for MM, we discuss the following few scheduling policies.
3.1 Synchronous Policy
This policy makes super-program of MM imitate a implementation for distributed memory parallel computer
(DMPC). A DMPC consists of a number of nodes. Each node has its own local memory, and there is no global
shared memory. Nodes communicate with each other via message passing. Computations and communications in
a DMPC are globally synchronized into steps. A step is either a computation step or a communication step. In a
computation step, each node performs a local arithmetic/logic operation or is idle. A computation step takes a
constant amount of time for all nodes. If a node finishes its work for the time slot, it also is idle for reminded time.
Under this policy, initially we determine the mapping of sub-matrix blocks of the product matrix Ci,j to
processing nodes. It is similar to the layout described above. Each node is responsible of a part of the product
matrix. But the area needs to neither a rectangle nor continuous. Scheduler schedules super-instructions based on
the round and the mapping. In the k-th round, only SI I i,j,k ( for all of i, j =0, 1, … , q-1) will be scheduled. Only
corresponding updates are performed in this round computation step. When a node is available to execute another
SI, the node is assigned an SI to execute if and only if there is an unassigned SI, which I i,j,k update a sub-matrix
block that is mapped to the node. Otherwise, the node waits for the next round to begin.
Thus, the super-instructions belong to different rounds are never executed in parallel. All SIs belonging to the
same round are executed pseudo-synchronously. So we call this policy synchronous policy. It is clear that this
scheduling policy makes the super-program implementation similar to that for the algorithm described in the
previous section.
5
3.2 Static Scheduling Policy
Similarly to the above synchronous scheduling policy, a static mapping of sub-matrix blocks of C onto
processor nodes is determined at static time. But these SIs are not assigned to the nodes in a synchronous manner.
Once a node is available to execute another SI, the scheduler checks to see if there exists such a qualified
candidate SI waiting for execution; this candidate should correspond to the update of a matrix block that belongs
to this node. If so, no matter what round this SI belongs to, the scheduler will assign the SI to this node for
execution. Otherwise, it leaves the node idle. Thus, a node only executes the SIs that computes sub-matrix blocks
statically assigned to it. Among these nodes, these SIs are executed asynchronously.
In a more restricted (k-inner) manner, the scheduler may first check to see if there exists a qualified candidate
that involves the update of the same matrix block as the SI previously running on the node (with the same i and j).
If yes, it schedules this SI. Otherwise, it schedules any qualified SI. Thus a node computes the sub-matrix blocks
of the product one-by-one.
3.3 Random (Dynamic) Scheduling Policy
In this policy, once a node is free the scheduler first tries to find an unassigned SI that updates the same sub-
matrix block with the previous SI running on this node ( these SIs have the same i and j but different k). If it is
successful, it assigns the new SI to the node. Otherwise, it checks to see if there exist some sub-matrix blocks that
have never been updated. If so, it randomly chooses a sub-matrix block from this set and assigns an SI to update
the block in this node.
Thus, any sub-matrix block can be assigned to a node. Once a block is assigned to a node, the node
exclusively executes all the SIs that update the block. When a node needs more workload, the scheduler will
randomly choose a sub-matrix block.
3.4 Smart (Dynamic) Scheduling with Random Seed policy
This scheduling policy is similar as above Random (Dynamic) Scheduling Policy. The only difference is the
way of choosing a sub-matrix block. When a node needs more workload, the scheduler will choose a sub-matrix
block from all unassigned sub-matrix blocks for the node based on its priority that is determined as follows:
1) For a node, an unassigned sub-matrix block Ci,j has high priority if there exist both a sub-matrix block Ci’,j
and a sub-matrix block Ci,j’ in the blocks that have been assigned to the node.
2) For a node, an unassigned sub-matrix block Ci,j has medium priority if it has not high priority and there
exist either a sub-matrix block Ci’,j or a sub-matrix block Ci,j’ in those blocks that have been assigned to the
node.
3) For a node, an unbound sub-matrix block Ci,j has low priority if there exist neither a sub-matrix block Ci’,j
nor a sub-matrix block Ci,j’ in those blocks that have been assigned to the node.
The scheduler always assigns a block with highest priority to a node. If only low priority blocks exist, a
random choice is made for assignment.
6
3.5 Smart (Dynamic) Scheduling with Static Seed policy
This policy is very similar to the above Smart Scheduling with Random Seed policy except that we now
provide a scheme to choose the first block assigned to each node. When scheduler assigns a sub-matrix block to a
node initially, the list of blocks assigned to it is empty and all unassigned candidate blocks have low priority. In
this special case, the scheduler chooses a block based on a 2D grid arrangement, like static scheduling policy
rather than a random choice.
4. Comparison With Other Parallel Programming Models There exist two distinct popular programming models for parallel programming,. They are the Message-
Passing and Shared-Memory models [18, 22]. Generally, both programming models are independent of the
specifics of the underlying machines. There exist tools to run programs developed under any of these models for
execution on any type of parallel architecture. For example, a language that follows the shared-memory model can
be correctly implemented on both shared-memory architectures and share-nothing multi-computers; this is also
true for a language that follows the message-passing model. For example, ZPL is an array programming language
that follows the shared-memory model [23]. Nevertheless, for distributed memory implementations it employs
either MPI or PVM for communications. Cilk also follows the shared-memory model [20].
The optimization of parallel application programs is done for specific architectures. For this reason, the
Distributed-Shared Memory model has been proposed. UPC does follow this model [17]. It gives up the high-
level abstraction of the shared-memory model and exposes the lower-level system architecture to programmers. It
provides programmers the capability to exploit the distributed features of the architecture, thus increasing their
responsibility in better program implementations.
An array programming language, like ZPL, exploits only data-level parallelism. The higher-level structure in
programs is executed sequentially. POOMA is similar but follows an object-oriented style [25]. A library
provides a few constructs to build logic objects and access them in a global-shared style using data-parallel
computing for high performance. To exploit higher-level parallelism, task-oriented approaches are used where
extracting the parallelism between program structures is a major task for programmers. The runtime environment
exploits this parallelism for task scheduling. A few multithreaded approaches have also been developed for task
mapping to threads. Cilk is such an approach that employs a C-based multithreaded language [20]. The
programmer uses directives to declare the structure of the task tree. All the sibling tasks are executed in parallel
within multi-threads that share a common context in the shared memory. EARTH-C also is such an approach [19,
21].
7
Approaches that exploit data parallelism, like ZPL and POOMA, usually do not mention load balancing
explicitly. Load balancing is done implicitly by the programmer though data partitioning. For task-oriented
approaches, load balancing is normally performed at run time through a combination of assigning tasks to
processors, and often preempting and/or migrating tasks at run time, as deemed appropriate by the run-time
environment. Cilk uses thread migration to balance the workload. Charm++ behaves similarly; it encapsulates
working threads and relevant data into objects. The system balances the workload by migrating these objects at
run time.
Our SPM model employs load control technology for load balancing. Only a limited number of SIs are
originally assigned to processors, where each SI has limited workload. All other SIs are left in an SI queue. When
a processor completes the execution of an SI, the system automatically assigns it another SI. This mechanism
guarantees that no processor is either overloaded or underloaded. Thus, the workloads are balanced automatically
among the processors. Our approach, similar with Cilk, employs a greedy utilization scheduler at run time.
However, our scheduling approach is not based on processes and does not migrate threads, thus eliminating the
overhead of task migration. On a shared-memory system, thread migration with Cilk has very low overhead (the
implementers claim 2-6 times the overhead of a function call). However, this cost does really exist. [20] suggests
a high utilization schedule on a shared-memory system may be better than a maximum utilization schedule. It
means that if the difference in processor speeds is not large enough, migrating a thread from a slow processor to a
faster processor is not better than keeping the thread on the original (slow) processor. On a shared-nothing system,
the cost of thread migration is much higher than that on a shared-memory system. In this case, a performance
model that considers thread migration but ignores its cost cannot be expected to provide a very accurate
prediction.
Generally, SPM is a non-preemptive, multi-threaded execution model. It was first tested with a data-mining
problem [24].
In the above mentioned projects, the life time of threads is determined by the programmer or compiler. In
SPM, it is controlled by the run-time environment. The programmer only requests thread creation but threads are
created by the runtime system based on a load control policy. Load control technology has been used for many
years to improve the performance of multitasking in centralized (single processor) computer systems. Both Cilk
and Charm++ do not use any load control. When a function is invoked, the runtime environment of Cilk creates a
new thread and adds the load to a processor of the system no matter how busy the system is. The threads are
similar to SPM SIs. But an SI under SPM is assigned to a processor only when enough resources are available.
Thus, SIs never need to be migrated.
In all of the parallel programming models, the addressable data are treated similarly to data for the sequential
programming model. In the message passing model, an application uses a set of cooperating concurrent sequential
processes. Each process has its own private space and runs on a separate node. Except for message exchanges
among these processes, the processes behave like sequential. For the shared memory model, the programmer can
access any non-hidden data. For example, in Cilk or UPC programmers can access all data as ANSI C. In SPM,
similarly to the shared-memory model, there is global data that has higher-level abstraction. Primary data are
super-data blocks which are pure logic objects in the objects domain and are associated with a given application
domain; the primary operations are carried out by super-instructions. These super-data may map to a large block
of underlying data by the runtime system. But unlike all the aforementioned models, programmers can operate on
an object only as an atom with an SI. They can never access their internal components. Thus, application 8
programmers can only access these objects rather than lower-level block structures. This gives programmers a
coarse-grain data space.
5. Performance Analysis
Let us begin with some definitions. The ideal (inherent) workload Wideal of a piece of program is defined as
the total number of its ordinary instructions for its optimal sequential implementation. The ideal (inherent)
performance Pideal of a computer is defined as the rate to execute ordinary instructions per second. The shortest
execution time is T = Wideal / Pideal. The ideal (peak) performance of a cluster system with multiple computers is
defined as the sum of the performance of all member computers. The (real) workload W of a parallel program run
on a PC cluster is defined as the product of the overall execution time and the system (peak) performance. ( W =
(∑Pi )* T ). The workload W is definitely larger than the ideal workload Wideal. The difference between W and
Wideal is the so called overhead.
When a super-problem is implemented in parallel and the program runs on multiple nodes, the system needs
to use not only its computing resources to solve the problem but its communications resources as well in order to
exchange data with other nodes. For example, when a super-program runs on a PC cluster, the system not only
executes SI on its member nodes but it may also need to remotely load operands for the SIs. The latter will
increase the workload. It is the so called communication overhead Wcomm. When the tasks distributed among the
nodes are not balanced, some nodes will finish their tasks before others. This means that these nodes will be idle
for some time. This increases the overall time of solving the problem. We called this the unbalance overhead or
idle overhead Widle. The scheduler also needs to carry out some computation to correctly schedule SIs. It will
introduce some management overhead Wman.
Thus, W = Wideal +Widle + Wcomm +Wman (4-0)
The relative overheads are widle = Widle / Wideal , wcomm = Wcomm / Wideal and wman = Wman /Wideal respectively
Below we will build overhead models to analyze unbalance overhead of SPM program. Only simply discuss
management overhead at end of Section 5.
Under this model, a parallel program is modeled as a sequence of synchronous steps. Each step ends with
synchronization of the nodes, but inside a step there is not any synchronization needed. Within a step all nodes
execute as quickly as possible all tasks assigned to them in this step. Then, they wait for all the nodes to finish
their assigned tasks. In this case, the unbalance overhead is the sum of the products of the idle times Ti-idle of each
node i multiplied by its peak performance Pi (i.e Widle = ∑Pi* Ti-idle ). In a homogeneous environment (Pi = P0),
the unbalance overhead is Widle = P0 * ∑( Ti-idle ). For matrix multiplication, each sub-matrix can be computed
independently. There is no inherent synchronization requirement. Super-programs for matrix multiplication can
have only one synchronous step. A synchronous algorithm with a synchronous policy, however, may split the
program into many steps requiring synchronization.
For the sake of simplicity, without loss of generality, we assume that both operands of multiplication, A and
B, are partitioned into sub-matrix blocks of size N x N. The product matrix C is also partitioned into sub-matrix 9
blocks of size N x N. All of these sub-matrix blocks are n1 x n2 matrices. We assume that the workload of an SI,
which multiplies a pair of sub-matrices Aik * Bkj and updates the block Cij, conforms to statistical rule R0
corresponding to average workload L0 and deviation D0. Also, the program is executed on a PC cluster with n = p2
nodes.
For the synchronous algorithm, in each step a node carries out (N/p)*(N/p) multiplications. Each is applied
for a sub-matrix of C. Thus, in each step the expected workload of a node is
Lsyn = (N/p)*(N/p)* L0 (4-1)
and the deviation is Dsyn = (N/p)* D0 . We still estimate the practical maximum workloads Lmax-syn = Lsyn
+Dsyn. Then, the unbalance overhead of the system in a given step is (n-1)*( Lmax-syn – L2 ) = (n-1)* D2 and the
total unbalance overhead of the system for the N steps is
Widle-syn = N* (n-1)* Dsyn = (n-1)/n1/2 * (N2 * D0) ≈ n1/2 * (N2 * D0) (4-2)
For static scheduling, each node is responsible of computing (N/p)*(N/p) sub-matrices and the computation of
a sub-matrix of C involves N steps of multiplying a pair of sub-matrices Aik * Bkj. If N is large enough, according
to the large number theorem the expected workload of a node is Ls = (N/p)*(N/p)*N* L0 and the deviation is Ds =
(N/p)*(N/p)*N1/2* D0 = N3/2 / n1/2 * D0. We estimate the maximum workload in the p2 nodes to be Ls-max = Ls
+Ds. Then, the unbalance overhead of system is
Widle-s = (n-1)*( Ls-max – Ls ) = (n-1)* Ds = (n-1)/n1/2 * (N3/2 * D0) ≈ n1/2 * (N3/2 * D0) (4-3)
For any dynamic scheduling, we assume that the unit of assignment corresponds to computing a sub-matrix of
C rather than performances Aik * Bkj . In this case, the expected workload of an assignment is L1 = N* L0 and the
deviation is D1 = N1/2 * D0. When the last assignment is made to a node, all other nodes are doing some
computation and do not have any chance to be idle. If we estimate the average workload reminded for other nodes
to be half of L1 , then the expected value of unbalance overhead is
Widle-dyn = (n-1)*1/2 * L1 = (n-1)*N/2 * L0 ≈ n/2 *N * L0. (4-4)
Comparing Widle-s with Widle-syn, we can find Widle-syn= N1/2 * Tidle-s. Thus, synchronous algorithm always has
N1/2 times larger unbalance overhead than an asynchronous algorithm with a static scheduling policy. We also
have
Widle-s / Widle-d = 2 * (N/n)1/2 *( D0/ L0) (4-5)
Thus, when the deviation of the basic operation (it corresponds here to the multiplication of a pair of sub-
matrices) is insignificant, then static scheduling has less unbalance overhead. However, if the deviation of the
basic operation is significant, when N > n *( L0/ D0), Tuo-s / Tuo-d > 1. Thus, static scheduling has heavier
unbalance overhead than any dynamic scheduling.
6. Simulation
6.1 Simulation Program
Any SI is simulated as an abstract task with a workload that is determined at static time. A member
computer is simulated as an abstract “Node” object with a peak performance parameter. Each node records its
10
accumulated idle time, the time for loading data and the time for executing SIs. It also keeps track of all tasks
that have assigned to the node, the next task it will finish and the time it finishes the task. All nodes are put in a
queue based on the time they will finish execution of their next task. Scheduling algorithms are implemented
rather than simulated. After nodes have been assigned their initial tasks, in each step simulator finds the next
task to be ended, get the time that the task would be finished and the node that executes the task. The system
time is updated to the time that the task would be finished, which is published to all nodes; the scheduler
schedules new tasks to the node; and then the node would be put into queue if it is still active. The simulator
keeps running until all SIs have been executed and all nodes have become idle. In our simulation, the static
schedule was created manually in advance. Except for some parameters that were determined at run time by the
scheduler, all others simulation parameters were generated in advance based on the statistical distribution
features of the parameters. These include the workloads of the SIs. Except for the simulation of synchronous
multicast, in simulations tasks scheduled as a group of SIs update the same sub-matrix block in C.
6.2. Parameters Chosen
The average time to execute an SI is 10000 units of time. The parameter is a basic parameter. Some
networking parameters are chosen based on some pilot program and others are chosen arbitrarily. We choose the
size of all matrices to be 32 x 32. All elements of them are sparse square matrices having same dimension. We
assume that the workload of SIs conform with the uniform distribution and the deviation of maximum/minimum
values is the same as the average value. This means that the workload is distributed uniformly between 0 and
20000.
The system runs under the CPU bound condition. For loading a sub-matrix block remotely, a sending node
requires 250 units of time and a receiver take 250 units so that the total cost to load the two operands of an SI
remotely is 10% of its execution time. In our pilot program, we found for our PC cluster system that takes about
20% CPU time to handle high-level object communication when the networking resource is fully used. Under
the CPU bound condition, the networking resource may not be fully used. So we assume that the total
percentage of time to load the operands of an SI remotely is 10%.
The number n of PCs in the cluster is chosen as 1, 2, 4, 8, 16, 32, 64, 128 and 256. In homogeneous
environment, the peak performance of any node is 1 unit. In heterogeneous environment, the relative peak
performance of any node with an odd index is 3 units and the relative peak performance of any node with an
even index is 1 unit.
For the static strategy, the layout of sub-matrix blocks statically is scheduled as shown in Table 1 for 4
nodes. The left column and the top row are the indices of sub-matrix blocks. The numbers in the tables are the
index of the nodes that the sub-matrix block is assigned to. For other cases with a different number n of nodes in
the cluster (n = 1 to 256), similar schemes are given. For the Smart scheduling with Static seed (SS strategy),
the initial task assigned to each node is chosen to be the same as that for the static strategy.
11
Table 1 Static schedule for a cluster with 4 nodes
(a) a homogeneous environment (b) a heterogeneous environment 0-15 15-31 0-11 12-19 30-31
0-15 0 2 0-15 0 1 2 15-31 1 3 15-31 0 3 2
For multicasts, the membership of nodes in the multicast group is determined statically as follows for all
strategies. For multicasting the sub-matrix blocks of matrix A, all nodes located in a same row in Table 1 form a
group; for multicasting the sub-matrix blocks of matrix B, all nodes located in a same column in Table 1 form a
group. When a node requests a data block, all nodes in the same group will receive this data block.
In our simulation, the sub-matrix blocks of A and B are cached separately. The unit of cached data is a row
of sub-matrix blocks of A and a column of sub-matrix blocks of B. The size of the local cache is the number of
rows of A and columns of B cached. The size of the cache is chosen to be 1-32. The size of 32 means that every
node can cache 32 rows of A and 32 columns of B. Therefore, every node can fully cache all data it needs.
All simulations repeat 16 times with different data sets. The results are the average values of multiple
simulation runs.
6.3. Results of Simulations
Simulations results of for different scheduling algorithms in a homogeneous environment are shown in Fig
1. Syn represents the synchronous algorithm; S represents the static scheduling algorithm; R represents the
random scheduling algorithm; SS represents the smart scheduling algorithm with a static initial scheme; SR
represents the smart scheduling algorithm with a random initial scheme. Excepts for Syn, all others try to use
both unicast and multicast operations for data communications.
The results for different scheduling algorithms in a heterogeneous environment are shown in Fig. 2.
The execution times of the simulation program in various cases are shown in Fig.3. They are the upper
bound of management overheads. u and m represents the program that uses unicast and multicast for
communication, respectively. H and R represents the programs run in a homogeneous and heterogeneous
environment, respectively.
From Fig.1, we can find that synchronous algorithms has much higher unbalance overhead although the
synchronous, algorithm are accepted widely in parallel implementations for MM. For high performance, it is
favored to asynchronously update those independent sub-matrix blocks to perform MM on the PC cluster. Thus,
12
when there exists diverse workload between updating different sub-matrix blocks, the super programs in SPM
are more appropriate to perform MM in the PC cluster environment than the programs in synchronous fashion
for synchronous fashion for DMPC.
01SynSRSSSR0
0.005
0.
0.015
unicast multicast(a) nodes
0
0.01
0.02
0.03
0.04
unicast multicast(b) 4 nodes
SynSRSSSR
0
0.02
0.04
0.06
0.08
unicast multicast(c) 8 nodes
SynSRSSSR 0
0.05
0.1
0.15
unicast multicast(d) 16 nodes
SynSRSSSR
00.050.1
0.150.2
0.25
unicast multicast(e) 32 nodes
SynSRSSSR 0
0.1
0.2
0.3
0.4
unicast multicast(f) 64 nodes
SynSRSSSR
00.10.20.30.40.50.6
unicast multicast(g) 128 nodes
SynSRSSSR 0
0.2
0.40.6
0.8
unicast multicast(h) 256 nodes
SynSRSSSR
Fig 1 comparison of unbalance overhead of super-programs with different strategies in
homogeneous environment for PC clusters with different number of nodes.
This set of results also shows that dynamic scheduling algorithms always have less unbalance overhead than
static scheduling algorithms no matter if unicasting or multicasting is used for data communication. The
13
00.0010.0020.0030.0040.005
unicast multicast(a) 2 nodes
SRSSSR
0
0.005
0.01
0.015
unicast multicast(b) 4 nodes
SRSSSR
00.0050.01
0.0150.02
0.025
unicast multicast(c) 8 nodes
SRSSSR
00.010.020.030.040.05
unicast multicast(d) 16 nodes
SRSSSR
0
0.05
0.1
0.15
unicast multicast(f) 64 nodes
SRSSSR
00.020.040.060.080.1
unicast multicast(e) 32 nodes
SRSSSR
difference in the unbalance overhead between static and dynamic scheduling depends on the average number of
tasks executed on a node . The more tasks a node executes, (less nodes in the cluster), the larger the difference.
When there are only 4-8 tasks per node (case g and h), the different is small. In other cases, the different is
significant. Thus, for large problems, dynamic strategies have an advantage to reduce the unbalance overhead.
The difference in the unbalance overhead among different (dynamic) scheduling strategies,
00.050.1
0.150.2
0.25
unicast multicast(g) 128 nodes
SRSSSR
00.10.20.30.40.50.6
unicast multicast(h) 256 nodes
SRSSSR
Fig 2 comparison of unbalance overhead of super-programs with different strategies in
heterogeneous environment for PC clusters with different number of nodes.
14
(a)
0100200300400500
1 2 4 8 16 32 64 128 256number of nodes
(ms)
HuHmRuRm
(b)
0100200300400500
1 2 4 8 16 32 64 128 256number of nodes
(ms)
HuHmRuRm
(c)
0
100
200
300
400
1 2 4 8 16 32 64 128 256number of nodes
(ms)
HuHmRuRm
(d)
0100200300400500
1 2 4 8 16 32 64 128 256number of nodes
(ms)
HuHmRuRm
Fig 3 comparison of management overhead of super-programs with different strategies
(a) static strategy (c) smart strategy with static initials
(b) random strategy (d) smart strategy with random initials.
however, is very small in a homogeneous environment
Comparing the results for a heterogeneous environment shown in Fig2 with corresponding results for a
homogeneous environment shown in Fig 1, we can see that the basic feature is similar. But the unbalance
overhead increases. This is because in a heterogeneous environment the nodes have different peak performance.
The slower nodes have more chance to execute the last SI. Other nodes with higher peak performance may be
idle at the same time, thus wasting more computation capability. The increase in the dynamic strategies is larger
than the increase in the static strategy. This makes the advantage of dynamic strategies to decrease. The critical
value is increased to 32 tasks per node. Fig 2 (e). When there are only 4-8 tasks per node, (case g and h), the
static strategy becomes the favor. This indicates that in a heterogeneous environment the dynamic strategies also
are favorable to reduce the unbalance overhead. But they only are helpful for large problems
In Fig.3, we can know that the execution times of simulation programs are all less than 400ms. Many cases,
the execution times are less than 100 ms. Since the scheduling algorithms were implemented rather than
simulated, the management overhead for both static and dynamic scheduling is less than 400 ms for the
multiplication of two 32x32 matrices. Considering there that are 32K SIs to be executed in multiplying two
32x32 matrices, and each SI may take a few ms, the relative management overhead of SP is very low. The
average time to execute an SI depend on the size of sub-matrix blocks that is a designed parameter. If the
average time is 4ms, the maximum management overhead is only 0.3% of total execution time. In many cases,
they are less than 0.1%. Compared to the unbalance overheads, it can be ignored. Comparing the figures for 15
different scheduling strategies, we can find that smart scheduling strategies have a little higher management
overhead than the random strategy and the static strategy. But the difference is small. The execution times to
simulate super-programs in a heterogeneous environment have not big different from those in a homogeneous
environment in all cases. Thus, it is appropriate to applying these strategies in both a homogenous and
heterogeneous environment
It should be indicated that, for embedded problems using the static algorithm, finding the static scheme
makes some additional management overheads. The research in [13] proves that in a heterogeneous environment
finding the optimal static scheme itself is not a trivial task. It is an NP-complete problem. A dynamic approach
does not have this additional overhead. All management overheads is included in the implementation of
scheduling.
7. Experimental Results on a PC Cluster All the experiments were performed on a PC cluster with eight nodes. The random/dynamic scheduling
approach was followed. Each dual-processor node is equipped with AMD Athlon processors running at 1.2
GHz. Each node has 1GB of main memory, a 64K Level-1 cache and a 256K Level-2 cache. All the nodes are
connected through a switch to form an Ethernet LAN. Each link has 100Mbps bandwidth. All the PCs run Red
hat 9.0 and have the same view of the file system by sharing files via an NFS file server. All data are obtained
from this file server. We used a set of synthetic sparse matrices of size 8192x8192; they contain 5% non-zero
elements. These matrices were partitioned into 32x32 sub-matrices of size 256x256. Non-zero elements were
generated randomly and contained double-precision floating-point numbers between 1.0 and 2.0. The location
of non-zero elements was chosen randomly. The super-program was developed manually using the super-
instructions described in Section 2. We implemented a simple runtime environment (i.e., a virtual machine),
which supports our super-programming model. The runtime environment consists of a number of virtual nodes
(each residing in a PC node). Super-instructions are scheduled to run on these virtual nodes. The runtime
environment also caches recently used super-data blocks. The runtime environment and the super-instructions
were implemented in the Java language. A special function was added in the code to collect information at
runtime about the execution of Sis and the utilization of individual PCs. The execution times for matrix
multiplication were extracted from this collected time information.
To check how many threads were needed for the system to run under the CPU bound condition, we first run
the program on just two PCs. Each PC was allowed to launch n threads to execute n SIs simultaneously, for n =
4, 8, 12 and 16. The results are shown in Fig. 4. The execution with 4 threads has much higher execution time.
Using 8 or more threads for each node, the total execution times are very close. Thus, 8 threads can hide the
long latency of loading remotely super-data blocks (operands). To check the speedup of the program, we run it
with n nodes, for n = 1, 2, 3, 4, 5, 6 and 8. The results are shown in Fig. 5. When using 4 threads on each node,
16
the system may run under the delay bound condition. Even the speedup in this case (4T) is better than the case
(8T) where 8 threads are allowed in each node, the total performance in the former case is worse than the latter.
0
50
100
150
200
4 8 12 16Number of Thread
Exec
utio
n Ti
me
of
MM
(Sec
onds
)
Fig. 4. The effect of the number of threads per node on the performance of the super-program
0
100
200
300
400
1 2 3 4 5 6 8Number of Nodes
Exec
utio
n Ti
me
of
MM
(sec
ond)
4T8T
0123456
1 2 3 4 5 6 8Number of Nodes
Spee
dup
4T8T
(a) (b)
Fig.5. The execution time and speedup for different numbers of nodes
8. Conclusions 1. Super-programs in SPM can exploit high-level parallelism of problems by executing coarse/medium-grain
tasks parallel. SPM is appropriate to parallel implement various algorithms.
2. The overhead of performing sparse matrix multiplication in parallel involves not only communication
overheads but also unbalance overheads. For an embedded problem, it also involves management overhead.
If the workload of block operations varies, such as for sparse matrix multiplication, the algorithms with
fine–level synchronization have high unbalance overhead. Algorithms using coarse-grain tasks have usually
less unbalance overhead. Thus, exploiting high-level parallelism makes super-programs can execute tasks
asynchronously. This can avoid high unbalance overhead. Dynamic scheduling of super-programs can
reduce the unbalance overhead further effectively. Algorithms using dynamic scheduling policies have less
unbalance overhead than those involving a static scheduling policy.
3. The management overhead for dynamic algorithm is similar among different scheduling strategies. There is
no additional management overhead for embedded problems no matter in a homogeneous or heterogeneous
environment. All management overheads are included in the execution of scheduler. They are insignificant.
17
In a heterogeneous environment, management overhead of static strategy for embedded problems is
expected to be high. Thus, super-programs in SPM with dynamic strategies have a great advantage in a
heterogeneous PC cluster.
References
[1] I. S. Duff , M. A. Heroux , R. Pozo An Overview of the Sparse Basic Linear Algebra Subprograms: The
New Standard from the BLAS Technical Forum, ACM Trans. on Math. Software (TOMS) Vol28(2), p239-
267, Jun. 2002
[2] K. Li, Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers, Proc. Parallel
and Distributed Processing Symposium (IPDPS), 2000. p307 –314, 2000
[3] T.I. Tokic, E.I. Milovanovic, N.M. Novakovic, I.Z. Milovanovic, M.K. Slojcev, Matrix Multiplication on
Non-planar Systolic Arrays, 4th International Conf. on Telecommunications in Modern Satellite, Cable and
Broadcasting Services, 1999, Vol2, p514 -517, 1999
[4] K. Dackland , Bo Kågström, Blocked Algorithms and Software for Reduction of a Regular Matrix Pair to
Generalized Schur Form, ACM Trans. on Math. Software (TOMS) Vol25 (4), p425-454 Dec. 1999
[5] Jaeyoung Choi, A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent
Computers, High Performance Computing (HPC) Asia '97, p224 –229, May 1997
[6] Iain S. Duff , Michele Marrone , Giuseppe Radicati , Carlo Vittoli, Level 3 Basic Linear Algebra
Subprograms for Sparse Matrices: a User-Level Interface, ACM Trans. on Math. Software (TOMS) Vol23
(3), p379-401,Sept. 1997
[7] A.E. Yagle, Fast Algorithms for Matrix Multiplication Using Pseudo-Number-Theoretic Transforms, IEEE
Trans. on Signal Processing, Vol. 43(1), p71 -76, Jan 1995
[8] Hyuk-Jae Lee; J.A.B. Fortes, Toward Data Distribution Independent Parallel Matrix Multiplication, Proc.
9th Int. Parallel Processing Sym., p436 –440, Apr 1995
[9] T. Hopkins, Renovating the Collected Algorithms from ACM, ACM Trans. on Math. Software (TOMS)
Vol28 (1) p59-74, Mar. 2002
[10] Bo Kågström , Per Ling , Charles van Loan, GEMM-based Level 3 BLAS: High-performance Model
Implementations and Performance Evaluation Benchmark, ACM Trans. on Math. Software (TOMS) Vol24
(3), p268-302, Sept. 1998
18
[11] Jack J. Dongarra , L. S. Blackford , J. Choi , A. Cleary , E. D'Azeuedo , J. Demmel , I. Dhillon , S.
Hammarling , G. Henry , A. Petitet , K. Stanley , D. Walker , R. C. Whaley, ScaLAPACK user's guide
(Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997)
[12] Nicholas J. Higham, Exploiting Fast Matrix Multiplication within the Level 3 BLAS, ACM Trans. on Math.
Software (TOMS) Vol16 (4), p352-368, Dec 1990
[13] O. Beaumont, V. Boudet, F. Rastello, Y. Robert, Matrix Multiplication on Heterogeneous Platforms IEEE
Trans. on Parallel and Distributed Systems, Vol.12 (10), p1033 –1051, Oct 2001
[14] S. Huss-Lederman, E.M. Jacobson, A. Tsao, Comparison of Scalable Parallel Matrix Multiplication
Libraries, Proc. Conf. Scalable Parallel Libraries, p142 –149, Oct 1993
[15] Keqin Li, Yi Pan, Si Qing Zheng, Fast and Processor Efficient Parallel Matrix Multiplication Algorithms
on a Linear Array with a Reconfigurable Pipelined Bus System, IEEE Trans. on Parallel and Distributed
Systems, Vol.9 (8), p705 –720, Aug 1998
[16] Mi-Jung Noh, Youngsik Kim, Tack-Don Han, Shin-Dug Kim, Sung-Bong Yang, Matrix Multiplications on
the Memory Based Processor Array, High Performance Computing on the Information Superhighway, (
HPC) Asia '97, p377 –382, May 1997
[17] T. El-Ghazawi and F. Cantonnet, UPC performance and potential: A NPB experimental Study, Proc. 2002
ACM/IEEE conf. on Supercomputing (SC’02), p1-26, Nov. 2002.
[18] R. Jin and G. Agrawal, Performance prediction for random write reductions: a case study in modeling
shared memory programs, ACM SIGMETRICS intern.conf. on Measur. and Model. of computer systems,
p117 – 128, 2002
[19] G. M. Zoppetti, G. Agrawal, L. Pollock, J. N. Amaral, X. Tang and G. Gao, Automatic compiler techniques
for thread coarsening for multithreaded architectures, Proc. 14th intern. Conf. on Supercomputing (SC’00),
p306-315, may2000
[20] M. A. Bender and M. O. Rabin, Scheduling Cilk multithreaded parallel programs on processors of different
speeds Proc. 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 13-21, 2000.
[21] A. Sodan, G. R. Gao, O. Maquelin, J.Schultz and X.Tian Experiences with non-numeric applications on
multithreaded architectures . Proc. of 6th ACM SIGPLAN symposium on Principles & practice of parallel
programming, Vol32 (7)
[22] P. Mehra, C. H. Schulbach and J. C. Yan A comparison of two model-based performance-prediction
techniques for message-passing parallel programs Proc. 1994 ACM SIGMETRICS conference on
Measurement and modeling of computer systems, Vol. 22 (1) p181-190, 1994
19
[23] C. Lin and L. Snyder, "ZPL: An Array Sublanguage," Languages and Compilers for Parallel Computing,
U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds, pp. 96-114, 1993.
[24] D. Jin and S.G. Ziavras, Load-Balancing on PC Clusters With the Super-Programming Model, Worksh.
Compile/Runtime Techniques for Parallel Computing, (ICPP03), Kaohsiung,Taiwan, Oct. 6-9, 2003 (to
appear).
[25] S. Atlas, S. Banerjee, J. Cummings, P.J. Hinker, M. Srikant, J.V.W. Reynders, and M. Tholburn, POOMA:
A high performance distributed simulation environment for scientific applications, SuperComputing '95
20
Session 3: Scientific and Engineering Applications
A Scene Partitioning Method for Global
Illumination Simulations on Clusters
M. Amor, J. R. Sanjurjo, E. J. Padron, A. P. Guerra and R. Doallo
August 30, 2003
Abstract
Global illumination simulation is highly critical for realistic im-age synthesis; without it the rendered images look flat and synthetic.The main problem is that the simulation for large scenes is a hightime-consuming process. In this paper we present a uniform partition-ing method in order to split the scene domain into a set of disjointsubspaces which are distributed among the processors of a distributedmemory system. Since polygons partially inside several subspaces haveto be clipped, we have developed a new clipping algorithm to managethis situation. Memory and communication requirements have beenminimized in the parallel implementation. Finally, in order to evaluatethe proposed method we have used a progressive radiosity algorithm,running on two different PC clusters with modern processors and net-work technologies as a testbed. Good results in terms of speedup havebeen obtained.
keywords:
Computer Graphics, Global illumination method, Progressive Radiosity,Cluster Computing, Scene Partitioning
1 Introduction
Achieving a good partitioning of data is a critical issue to address in order toget efficient execution times for scientific simulations on high performanceparallel programs. This task becomes even more important in distributedmemory systems. In such platforms the effective distribution of data among
1
processor memories is very important in avoiding data replication (mem-ory constraint), minimizing communication among nodes (communicationconstraint) and achieving a good load balance (balance constraint).
Such a partitioning is commonly found by solving a graph partitioningproblem [14]. Therefore, it is a question of splitting the vertices of the graphinto p disjoint subspaces (assuming p processors), trying to make the vertexweight in the subspaces balance, and minimize the sum of the weights of theclipped edges which are connecting vertices in different subspaces.
Good examples of high time-consuming scientific simulations can befound in the field of Computer Graphics, where the quest for realistic syn-thetic images is leading to the use of more and more complex and com-putationally expensive global illumination models. Nowadays, one of themost popular ones is radiosity, that based on the real behavior of light in aclosed environment, produces high-quality images. However, even the mostadvanced and optimized algorithms for the radiosity computation, such asprogressive radiosity [3] or hierarchical radiosity [8], still have a significantcomputational cost and high memory requirements. So, decreasing the exe-cution time consumed for rendering large and complex scenes is needed.
In the last years several works about the parallelization of the radiositymethod have been proposed under different architectures [1, 6]; the mostrecent works are focusing on distributed memory systems [13, 15, 5, 11, 10],either on multicomputers or clusters, because of the good scalability andflexibility of these systems. As we stated previously, it is in these kinds ofmachines in which a good partitioning and mapping of input data onto theprocessors are vitally important.
In this work we have focused on an efficient partitioning algorithm us-ing a clipping method which we have developed. We have applied thisalgorithm to the parallel implementation of a progressive radiosity methodon distributed memory machines. The resulting method works with generalpolygon scenes and intend to solve some of the main problems that the exist-ing implementations present: they were not suited for all kind of scenes [13],or they do not use a very accurate visibility algorithm [15]. It should beremarked that the progressive radiosity method was used in this paper as atestbed for checking the benefits of our partitioning algorithm for 3D imagesynthesis; however, we actually think that similar results could be obtainedusing all those algorithms which work better with scenes with high visibilitybetween closer elements and low visibility between farther ones.
The paper is organized as follows: in Section 2 we describe the partition-ing method; in Section 3 we explain the clipping method used to split poly-gons spanning two or more subspaces; in Section 4 we present the parallel
2
progressive radiosity algorithm resulting from the application of our parti-tioning method; experimental results on two different clusters are shown inSection 5 and finally, in Section 6 some conclusions are highlighted.
2 Scene partitioning method
Within the broad existing partitioning techniques, we have focused on par-allel static graph partitioning algorithms, since the progressive radiositymethod (unlike, for instance, hierarchical radiosity method) does not usean evolving data structure; this way it is not necessary to apply more com-plex dynamic graph partitioning methods. Within static techniques we havechosen geometric techniques, as we are dealing with geometric scenes andthese techniques get an extremely fast partition based mainly on geometricinformation. Specifically, the applied algorithm is a variant of the Coordi-nate Nested Dissection (CND) algorithm [14], a geometric technique thatattempts to minimize the boundary between the subspaces (and that waythe inter-processor communications) by splitting the mesh in two halvesacross its longest dimension.
We have used a CND version that makes a parallel and geometric uni-form partition of the input scene. It splits the scene domain in a set ofdisjoint subspaces of the same shape and size, and it assigns all the objectscontained in a subspace to the same processing element. This kind of parti-tioning is computed quickly and straightforward and it is convenient to usethe visibility masks technique [1]. It also provides a high data locality, min-imizing inter-processors communication. However, this kind of partitioningtends to result in unbalanced partitions, which will represent a penalty inthe total execution time of the parallel algorithm.
Our implementation takes into account the following considerations:
• There are as many subspaces as processing elements and, because ofthe uniform partition, they must be equal in size. This means thatfor an odd number of processors and, therefore, of subspaces, we cannot divide a scene in two halves and then, recursively, split one ofthe children in two new halves (see Figure 1(a)) because resultingsubspaces would not have equal size. So the implemented algorithmis not recursive and the number of splitting planes that are needed tosplit the scene along each coordinate axis is computed in advance (seeFigure 1(b)).
• Because the partitioning is uniform, the planes that divide the mesh
3
bisectionfirst
second bisectiondone recursively
(a)
equidistant
a priori computeddividing planes
(b)
Figure 1: Recursive vs. a priori partitioning: (a) Recursive partitioning inthree subspaces. (b) A priori partitioning in 3 subspaces.
along the same coordinate axis must be parallel and equally spaced.Furthermore, each dividing plane must split the whole domain, notonly an existing subspace like in CND (see Figure 1(b)).
The partitioning algorithm has two stages. In the first step, that we callsubspace partitioning, the input coarse scene is replicated in every processor;this fact does not represent a drawback due to the low storage requirementsfor the coarse scene. Next, every processor divides the scene domain, deter-mines its subspace and computes its spatial boundaries. In the second stage,that we call subspace distribution, scene objects are distributed among pro-cessors according to the prior subspaces assignment. Once distribution is setup, each processor has a disjoint set of polygons and only performs meshingof these polygons.
2.1 Subspace partitioning
Each processor reads the whole scene data and computes the bounding vol-ume containing the entire scene. Next, each processor divides this volumeinto p uniform disjoint subspaces, where p is the number of processors. Thenumber of voxels per dimension (nx, ny and nz) is computed as the primefactors (di) of p, p =
∏mi=1 di, with p being an even number. To avoid long
subspaces, which could happen if the number of divisions in one dimensionwould be much greater than in the others, the prime factors are assignedto each dimension in the following way: Initially, nx = ny = nz = 1; thefactors are arranged in descending order; then the factors list is traversedand each factor di is assigned to the nx, ny, nz with the lowest value. Asan example, if we work with 36 processors, the sorted list of prime factors
4
(a) (b)
(c)
Figure 2: Three possible cases with a polygon inside a subspace.
would be 3, 3, 2, 2 and the voxels per dimension would be nx = 3, ny = 3and nz = 2× 2 = 4.
Subspaces are assigned to processors according to their coordinates. Dur-ing the execution of the parallel algorithm processors will exchange datausing a protocol that will be presented in Section 4. In order to simplifythe communication process, subspaces that are geometrically adjacent areassigned to neighboring processors. The assignment of subspaces is carriedout as follows: the subspace with coordinates (cx, cy, cz) is assigned to theprocessor identified as cx + cz · nx + cy · nx · nz.
2.2 Subspace distribution
The distribution of the scene is executed, like partitioning, in parallel by allthe processing elements. First, polygons are classified based on its geometricposition as totally inside, partially inside or totally outside. A polygon istotally inside a subspace if all of its vertices are inside the subspace (Fig-ure 2(a)). A polygon is partially inside if one or more of its vertices areoutside and at least one of the following conditions is true: any vertex isinside the subspace, or any edge of the subspace intersects the polygon(Figure 2(b)) or, finally, any edge of the polygon intersects any face of thesubspace (Figure 2(c)). Otherwise a polygon is totally outside.
Then, polygons are processed according to the previous classification:polygons totally inside the subspace are added to the list of polygons ofthe subspace; polygons partially inside the subspace are clipped, adding to
5
v0
1v 2v
3v
ie0
is 0
Exiting intersection
Entering intersection
x x
x
ia0
Interiorvertex
Edge intersection
(a)
v0
v1 v2
is0
ie01
ie1
x x
x
x x
x
Edge intersections
Exiting intersections
Enteringintersections
ia0 ia1
is
(b)
Figure 3: Two instances of polygonal clipping against a subspace.
the list of polygons of the subspace the part inside and discarding the rest.Finally, if the polygon does not belong to the subspace, it is totally discardedand not added to the list.
3 Clipping method
Clipping of a polygon against a domain consists of cutting out from thepolygon those pieces that are outside the frontiers of such domain. In thissection we present a 3D extension to the Liang-Barsky [9] algorithm whichclips polygons against rectangles. Four different cases must be considered:
• Interior vertices: vertices of the original polygon that are inside thesubspace.
• Entering intersections: if the vertex vi is outside the subspace and thevertex vi+1 is inside the subspace, then the edge vi − vi+1enters intothe subspace and produces an entering intersection in one of its faces.
• Exiting intersections: if the vertex vi is inside and the vertex vi+1 isoutside, then the corresponding edge exits out of the subspace andproduces an exiting intersection.
• Edge intersections: those produced by the intersection of one edge ofthe subspace with the polygon.
Figure 3 shows the four types of vertices that can arise as the result ofpolygon clipping. In this figure vertex v3 is inside the subspace and so it isan interior vertex; the following vertex, v0 is outside the subspace, so the
6
edge v3 − v0 crosses the subspace at is0, producing an exiting intersection.The edges v0 − v1 and v1 − v2 do not create intersections because all oftheir vertices are exterior vertices. The edge v2 − v3 crosses the subspacemaking the entering intersection ie0. Since the exiting intersection is0 andthis entering intersection ie0 belong to different faces of the subspace theremust be an edge intersection between them, named ia0.
In addition to taking into account the above considerations, every time anew clipping vertex is inserted in the clipping list the immediately previousand following vertices in the list must be checked to prove if they fulfill theconditions that their position involves. For that purpose we create a datastructure that stores all the information concerning to a vertex as can be seenin Table 1. The field named type indicates the type of the correspondingelement in the list: it can be an interior vertex, an entering intersectionvertex, an exiting intersection vertex, or an edge intersection. Fields namedface1 and face2 indicate the faces associated to the vertex. If the vertexis associated to just one face, the field face2 is left blank. If the vertex isinterior both face1 and face2 are left blank. The field link face stores theface which is shared with the following vertex in the list.
VertexType
Associated face 1Associated face 2
Link face
Table 1: Element of the clipping list.
3.1 Clipping algorithm
Our method consists of three stages in which the vertices involved in theclipping of a polygon are computed and stored in an ordered list. In the firststage, interior vertices and entering and exiting intersections are computed.In the second stage edge intersections are computed, and in the last stagethe resulting clipped polygon is tessellated. The insertion of vertices in thecorrect position in the clipping list is a task that needs further processing.
In the first stage of the clipping method we traverse the list of verticesof the polygon and for each vertex vi the following tasks are carried out:
• If the vertex is inside the subspace then it is stored in the first freeposition in the clipping list.
7
v0
v1 v2
0is 0ie
x
x
x
x
ia0 ia1
(a)
v0 0is 0ie
Face 2Face 1Type
Link face
Vertexinside exiting entering
left right
left right
(b)
Link faceFace 2 right frontFace 1 front left
edgeTypeVertex
edge1 ia0ia ia0 ia1
Link faceFace 2 front rightFace 1 left front
edgeTypeVertex
edge
(c)
Figure 4: Result of the first and second stages of the clipping method: 4(a)Clipping example. 4(b) Clipping list after the first stage. 4(c) Swapping inthe auxiliary list (2nd stage clipping method).
• We check if any of the six faces of the subspace intersects with theedge vi − vi+1. An edge intersects at most two faces of the subspace,in which case it will produce an entering and an exiting intersection.If any intersection is coincident with vi or vi+1 it is discarded. Whenthere are two intersections, the entering one is inserted in the first freeposition of the clipping list, and the exiting intersection is inserted inthe next position. If there is just one intersection, it is stored in thefirst free position in the list. For every intersection added to the listthe fields face1 and link face are initialized with the intersected faceof the subspace.
Figure 4(a) shows the clipping of a polygon used to illustrate the method.The procedure begins at the vertex v0 that, as it is an interior vertex, it isinserted in the first position of the list. Then, the edge v0 − v1 exits thesubspace and intersects the left face, that will be set up as its face1 and linkface, generating the exiting intersection is0, that it is inserted in the list.Vertex v1 is outside the subspace and the edge v1−v2 does not intersect thesubspace. Finally, v2 is outside the subspace, and the edge v2 − v0 entersthe subspace through the right face, generating the entering intersection ie0.This intersection is inserted in the list, with the right face of the subspaceas its face1 and link face. Figure 4(b) shows the clipping list at the end of
8
the first stage.In the second stage every edge of the subspace is tested for intersection
with the polygon. If the intersection exists, it is stored in a secondary listsimilar to the clipping list. Then, for every element of the secondary list thefollowing steps take place:
• The intersection of the edge is compared with the points in the clippinglist to check if they match. This could happen if any vertex of thepolygon belongs to one of the edges of the subspace, or if an edge ofthe polygon lies in the same plane that a face of the subspace (thusintersecting at the same time an edge and a face of the subspace). Inthese cases the intersection is discarded.
• If the intersection is not discarded we search the list to find the positionwhere the insertion should take place. For each element of the clippinglist its field link face is tested to check if it matches with any of thefields face1 or face2 of the secondary list element. In that case thisedge intersection should be inserted in the clipping list right after thecurrent intersection.
• If no position can be determined for inserting this edge intersection inthe clipping list, it has to go following another edge intersection still inthe secondary list. The element in the secondary list being processed isswapped with another element in the secondary list placed n positionsforward. Initially, and every time an edge intersection is removedfrom the secondary list and inserted in the clipping list, we set n = 1.Otherwise n is incremented by one and the process is repeated withthe element resulting from the swapping.
• Finally, an edge intersection is inserted in the clipping list, moving thefollowing elements one position forward. Besides, the field link faceof the edge intersection is updated with the associated face differentfrom the previous link face.
The left table in Figure 5 shows the secondary list with the edge inter-sections for the example in Figure 4. The order of edge intersections in thislist depends on the order in traversing the edges of the subspace. The righttable shows the list after the swapping has been done.
In the third stage, if the resulting polygon after clipping is not a triangle,it is decomposed in triangular polygons (tessellation).
9
ia03
21
−10isv0 ia0 0ie
2
−13
3
ia13
−1
23 ia1
0isv0 ia0 0ie
Face 2Face 1Type
Vertex
Link face
VertexType
Face 1Face 2
Link face
VertexType
Face 1Face 2
Link face
0
−1−1−1
−111
1
3
21
2
3233
0
−1−1−1
−111
1
3
21
2
−13
32
Figure 5: Insertion of an intersection during the second stage of the clippingmethod.
4 Parallel Algorithm
Our implementation is based in a coarse-grained parallelism, so every pro-cessor computes the radiosity solution for a set of patches of the scene. Themessage-passing library MPI is used to manage interprocessor communica-tion. The different phases of the parallel algorithm are described in detailin the next subsections.
4.1 Local Radiosity Computation
In each iteration every processor chooses the patch with the greatest energyto shoot, Pi (among all patches inside its local subspace and patches receivedfrom other processors/subspaces). The radiosity of the selected patch is shotand, as a result the other patches of the local subspace, Pj , may receive somenew radiosity and the patch Pi has no energy left to shoot. Each shootingstep can be viewed as multiplying the value of radiosity to be shot by acolumn of the form factor matrix. Therefore, in each iteration the requiredform factors and the visibility values between the chosen patch and the restof the patches of the local subspace a re computed. In order to calculatethe form factor we have used the analytic method called differential area toconvex polygon [2]. On the other hand, for visibility determination, we haveimplemented an algorithm based on directional techniques [7, 12].
In addition to shooting the energy in the local subspace, the rest of thescene has to be updated too, so each patch is sent along with the localvisibility information to other processors. This visibility information is en-coded as a structure called visibility mask [1, 13]. In our approach, we definefrontiers as faces separating two subspaces. A frontier is divided into n× n
10
1
0
Figure 6: Computation of the starting visibility mask.
patches. Each of these patches has associated a boolean value. This valueis 1 if the frontier patch is visible from the selected patch, and 0 otherwise(see Figure 6). Once the initial visibility mask is calculated, the processorsends it attached to a copy of the patch to the neighboring processors whichhave a common frontier with it.
4.2 Communication stage
Each processor stores the patches received from other processors in a queueof pending patches: a patch stays at that queue while its visibility informa-tion is not complete. Once the visibility information for a received patch iscomplete, the patch is moved from the pending queue to a queue of unshotpatches. Besides, now this patch can be sent to the neighboring processors.
In order to know when the visibility information is complete and to avoidsending a patch several times to a processor a flag is attached to each patch.This flag has two fields; one contains the information required to identify theprocessor which sends the patch and other identifies that patch in particu-lar. Then, from this information the processor determines the frontiers fromwhich it is going to receive visibility masks for the patch, and the frontiersthrough which it is going to send the new masks updated with the local vis-ibility information. Let us denote the coordinates of the subspace/processorcontaining the patch and the coordinates of this processor as (sx, sy, sz)and (cx, cy, cz) respectively. The analysis to determine whether a frontieris an input or output frontier regarding a patch consists of applying theseexpressions (x coordinate case, analogous for y and z cases):
F (cx + 1) = OUT · (sx ≤ cx) + IN · (sx > cx)
F (cx − 1) = OUT · (sx ≥ cx) + IN · (sx < cx)(1)
11
PE6 −> (0,1,0)
PE5 −> (2,0,1)
PE8 −> (2,1,0)
p
4
5
1
12
0
Figure 7: Input and output frontiers for patch p and processor 5 and 8.
11
11
11
1
11
11 1
A1
1
x
0
0
0
00
0
0
1
11
1
11
1
11
1
1C
xxxx
x
x
rays
Intersections Intermediate Mask
D
Subspace 0 Subspace 1 Subspace 2
Figure 8: Intermediate mask.
where F (cx+1) and F (cx−1) are the frontiers between the subspace assignedto the processor under consideration (cx, cy, cz) and the subspaces (cx +1, cy , cz) and (cx − 1, cy , cz) respectively; and OUT , IN indicate if it is anoutput or an input frontier.
In order to clarify, let us consider the example of Figure 7 where a scenehas been partitioned into 12 subspaces. This figure depicts the input fron-tiers for processors 5 and 8 regarding the patch P in processor 6. Processor 5must receive P by frontiers 1, 4 and 5, and it has to send nothing. Processor8 must only receive P once by frontier 1 and has to send it by frontiers 0and 2.
The propagation of one patch to the neighbor subspaces is carried outonce the patch and all the visibility information associated with it has beenreceived through the input frontiers. If the subspace has no output frontiers,then the patch information transmission finishes. Otherwise for every output
12
frontier the intermediate mask is computed. The basic idea, outlined inFigure 8, consists of updating the visibility masks by casting rays fromthe received patch to the output frontier through the input frontiers. Inthe figure the calculation of the intermediate visibility mask for the outputfrontier subspace1→ subspace2 regarding the patch A is depicted.
During this stage non–blocking communications are used to overlap com-putations and communications.
4.3 Convergence test
At the end of each iteration, after the communication stage, the convergenceis checked. In the first iteration, the patch with the greatest energy of thescene is chosen by means of a reduction operation (MPI Allreduce), there-fore this energy value is obtained in all processors. In the next iterations,each processor compares this value with the radiosity shot in that iteration.If the difference between them is greater than a given threshold and thequeue of pending patches becomes empty, the processor starts an idle state;otherwise a new iteration begins in the progressive radiosity algorithm.
In the idle state, processors send messages to one another with the sumof the number of sent and received patches. One processor would finish theiterative process whether this value is equal for all processors; otherwise itremains in an idle state because any patch could still be received.
5 Experimental Results
In this section, we present performance results and analysis of the proposedscene partitioning method. We have tested our method on two differentclusters of PCs: a cluster with Pentium III 800 MHz processors connectedthrough a 100 Mbps Fast Ethernet network (F. Ethernet Cluster), and acluster with Intel XEON IV 1.8 GHz processors connected through a SCInetwork (Scalable Coherent Interface) 250 Mbps (SCI Cluster).
Three different scenes have been rendered using our method. Two ofthem are shown in Figure 9: room-model (Figure 9.a) and armadillo (Fig-ure 9.b). The third scene, largeroom-model, is roughly the result of joiningtwo room-model scenes. The scenes have 5306, 3999 and 7070 triangles re-spectively in the sequential case. The room-model has been usually used toevaluate illumination methods and the armadillo was chosen to evaluate theperformance of our method for non-uniform scenes.
The results in terms of speedup on both machines -F. Ethernet Clus-ter and SCI Cluster- are shown in Figure 10. All results have been mea-
13
(a) (b)
Figure 9: Illuminated scenes. a) room-model b) armadillo.
sured with respect to the straightforward sequential algorithm, and showthe low overhead associated with our partitioning method. The executiontimes for the sequential algorithm on the F. Ethernet Cluster and the SCICluster for the largeroom-model were 344.44 seconds and 142.88 seconds, re-spectively; the corresponding parallel execution times on 8 processors were15.17 and 10.51 seconds. As we can see, the experimental results show thata super-linear speedup is obtained for these scenes. This is mainly due totwo reasons: partitioning of the scene brings about a high improvement indata locality, with a significant reduction in cache misses and a better ex-ploitation of memory hierarchies in each processor; and the use of visibilitymasks introduces approximations in the visibility computation producing agreat reduction in the visibility determination complexity with a little lossof accuracy.
On the other hand, the parallel execution time increases with regard tothe sequential time for scenes with a small number of triangles when only afew processors are used (see Figure 11 for 2 processors). The reason for thisbehavior is the overhead associated with our method: scene partitioning andvisibility masks computation. Beyond 4 processors, the effect of this over-head is balanced with the advantages introduced with the parallelization.Note that the speedup obtained on the F. Ethernet cluster is higher thanon the SCI cluster, because the cache miss rate in the latter is higher.
Summarizing, the performance of our method on the two different sys-tems has been very good, even for scenes with different uniformity degree.
14
largeroom−model
room−model
armadillo
2 3 4 5 6 7 8 1 0
5
10
15
20
25
Number of processors
ideal
Spee
dup
1 2 3 4 5 6 7 8 0
2
4
6
8
10
12
14
Spee
dup
Number of processors
(a) (b)
ideal
Figure 10: Results of the test scenes (a) F. Ethernet cluster (b) SCI cluster.
SCI Cluster
F. Ethernet Cluster
2 3 4 5 6 7 8 0
10
20
30
40
50
60
1
Number of processors
Exe
ctut
ion
Tim
e
Figure 11: Execution time for the armadillo.
6 Conclusions
Accomplishing of a good environment subdivision takes a very importantrole in a large number of applications in computer graphics. In this workwe have presented a parallel partitioning of an input scene, achieving avery low time-consuming subdivision of the scene in an arbitrary numberof subenvironments and avoiding the replication of data in memory. Asa demonstration we have applied the algorithm to a parallel implementa-tion of a progressive radiosity method, achieving a very flexible and generalimplementation, appropriate for heterogeneous systems.
Our parallel progressive radiosity algorithm maintains a high quality ratein the illuminated scenes. Furthermore, our method achieved high speedupson the two different clusters of PCs, exploiting data locality and reducing
15
and optimizing communication.
Acknowledgements
This work was supported in part by the Ministry of Science and Technol-ogy (MCYT) of Spain under the project TIC 2001-3694-C02 and Xunta deGalicia under the project PIDT01-PXI10501PR.
References
[1] B. Arnaldi, T. Priol, L. Renambot and X. Pueyo. Visibility Masks forSolving Complex Radiosity Computations on Multiprocessors. Proc. ofthe First Eurographics Workshop on Parallel Graphics and Visualization,219–232, 1996.
[2] M. Cohen and J. R. Wallace. Radiosity and Realistic Image Synthesis.Academic Press Professional, 1993.
[3] M. Cohen, S. E. Chen, J. R. Wallace and D. P. Greenberg. A Progres-sive Refinement Approach to Fast Radiosity Image Generation. ACMComputer Graphics (Proc. of SIGGRAPH ’88), 22(4):31–40, 1988.
[4] A. Fujimoto, T. Tanaka and K. Iwata. Arts: Accelerated Ray Tracing.IEEE Computer Graphics and Applications, 6(4):16–26, 1986.
[5] R. Garmann. Spatial Partitioning for Parallel Hierarchical Radiosityon Distributed Memory Architectures. Proc. of the Third EurographicsWorkshop on Parallel Graphics and Visualization, 2000.
[6] S. Gibson and R. J. Hubbold. A Perceptually-Driven Parallel Algorithmfor Efficient Radiosity Simulation. IEEE Transactions on Visualizationand Computer Graphics, 6(3):220–235, 2000.
[7] E. A. Haines and J. R. Wallace. Shaft Culling for Efficient Ray-TracedRadiosity. In Proc. of Photorealistic Rendering in Computer Graphics(Second Eurographics Workshop on Rendering), 122–138, 1994.
[8] P. Hanrahan, D. Salzman and D. Aupperle. A Rapid Hierarchical Ra-diosity Algorithm. ACM Computer Graphics (Proc. of SIGGRAPH ’91),25(4):197–206, 1991.
16
[9] Y-D. Lian and B. Barsky. An Analysis and Algorithm for Polygon Clip-ping. CACM, 26(11):868-877, 1983.
[10] D. Meneveaux and K. Bouatouch Synchronization and Load Balancingfor Parallel Hierarchical Radiosity of Complex Scenes on a HeterogeneousComputer Network. Computer Graphics Forum, 18(4):201–212, 1999.
[11] E. Padron, M. Amor, J. Tourino and R. Doallo. Hierarchical Radiosityon Multicomputers: a Load–Balanced Approach. Proc. of the TenthSIAM Conference on Parallel Processing for Scientific Computing, 2001.
[12] E. J. Padron, M. Amor, J. R. Sanjurjo and R. Doallo. ComparativeStudy of visibility algorithms in Global Illumination Methods. Technicalreport (in Spanish), 2003.
[13] L. Renambot, B. Arnaldi, T. Priol and X. Pueyo. Towards Efficient Par-allel Radiosity for DSM-based Parallel Computers Using Virtual Inter-faces. Proceedings of the Third Parallel Rendering Symposium (PRS’97),79–86, 1997.
[14] K. Schloegel, G. Karypis and V. Kumar. CRPC Parallel ComputingHandbook, chapter Graph Partitioning for High Performance ScientificSimulations Morgan Kaufmann, 2000.
[15] O. Schmidt and L. Reeker. New Dynamic Load Balancing Strategy forEfficient Data-Parallel Radiosity Calculations. Proceedings of Paralleland Distributed Processing Techniques and Applications (PDPTA ’99),1:532–538, 1999.
17
Efficient Scheduling for Low-Power High-Performance DSPApplications
Zili Shao, Qingfeng Zhuge, Youtao Zhang, Edwin H.-M. ShaDepartment of Computer Science
University of Texas at DallasRichardson, Texas 75083, USA
zxs015000, qfzhuge, zhangyt, edsha @utdallas.edu
Abstract
To process high performance DSP applications in em-bedded systems, VLIW (Very Long Instruction Word) ar-chitecture is widely adopted in high-end DSP processors.While better performance is achieved, more power con-sumption may be needed in order to execute several in-structions simultaneously in a VLIW processor. Therefore,an important problem is how to schedule an application ona VLIW processor such that the energy consumption of theapplication is minimized while the timing performance issatisfied.
To solve this problem, we identify the two most impor-tant factors: switching activity and schedule length, thatinfluence the energy consumption of an application exe-cuted on a VLIW processor. Considering these two fac-tors together, we present a novel instruction-level energy-minimization scheduling technique to reduce the energyconsumption of an applications on a VLIW processors. Wefirst formally prove that this problem is NP-complete. Thenthree heuristic algorithms, MSAS, MLMSA, and EMSA, areproposed and studied. While switching activity and sched-ule length are given higher priority in MSAS and MLMSA,respectively, EMSA gives the best result considering bothof them. The experimental results show that EMSA gives a
% reduction in energy compared with the standard listscheduling approach on average.
1 Introduction
In embedded systems, high performance digital signalprocessing (DSP) used in image processing, multimedia,wireless security, etc., needs to be processed not only withhigh data throughout but also with low power consump-tion. To satisfy the ever-growing performance requirement,a high-performance architecture with multiple functionalunits such as VLIW (Very Long Instruction Word) archi-
This work is partially supported by TI University Program, NSF EIA-0103709 and Texas ARP 009741-0028-2001, USA.
tectures, for example, TI TMS320C6K, is now commonlyused. Since multiple functional units are executed simul-taneously in these architectures, power consumption be-comes one of the most important issues to be consideredwith the concern of performance. Therefore, an efficientscheduling scheme is needed to reduce the energy con-sumption of an application as well as guarantee the re-quired timing performance. Recent research for variousprocessors [1, 2] shows that the instruction sequence ofan application plays an important role in its energy con-sumption. Thus, new research directions in power opti-mization have begun to address the issues of instruction-level scheduling for reducing energy consumption [3–5].In this paper, we analyze and design efficient algorithmsfor various types of instruction-level energy-minimizationscheduling problem, i.e., given a Directed Acyclic Graph(DAG) that models an application, how to schedule it to aVLIW architecture so that the energy consumption can beminimized while the required timing performance is satis-fied.
A VLIW processor executes a very-long instructionword (called long instruction word) during each clock cy-cle. A long instruction word is composed of several paral-lel sub-instructions. The power consumption to execute along instruction word during a clock cycle, , can becomputed by:
!#" $%& !#" $% (' *),+.-0/132 (1)
where 4 is the base power needed to support in-struction execution, !," $% is the basic power to executea sub-instruction 576 on a functional unit, and ' *),+.-0/1 isthe switching power caused by switching activities be-tween 598;:< 6 (current sub-instruction) and 5=8;:<?> (last sub-instruction) executed on the same functional unit (FU). Let' be a schedule for an application and @ the schedule lengthof ' . Then the energy A4B for Schedule ' can be computed
1
by
A B
@*
!#" $ %
!#" $ %
!#" $ % '
) +.-0/1 is the summation of basic power consumptions forall sub-instructions of an application. It does not changewith different schedules. @ and
' *),+.-0/1 will changewith different schedules though. Therefore, in order tominimize the energy consumption of an application, sched-ule length and switching activity both need to be consid-ered in scheduling.
A lot of research efforts have been put in this field.In [6], compiler techniques are proposed to reducepower variations, hot-spots and peak power in schedul-ing. Though these techniques can efficiently reduce powervariations, they don’t aim to minimize the total energyconsumption. Low power resource allocation approach[3, 7, 8] attempts to find an allocation for a fixed sched-ule in such a way that the total switching activities canbe reduced. While this approach is effectively appliedin resource allocation in the high-level synthesis, it maygive inferior results in solving instruction-level energy-minimization scheduling problem because scheduling andallocation are performed separately. In [5], a two-phasescheduling approach is proposed to optimize transitionactivity in the instruction bus on a VLIW architecture.Based on an initial schedule, this approach performs lowpower resource allocation (called horizontal scheduling)in the first phase and vertically schedule instructions inthe second phase to further reduce the switching activi-ties. Since the schedule length is fixed priori, this ap-proach can not be directly applied to solve instruction-levelenergy-minimization scheduling problem. In [4], severalrevised list scheduling techniques are proposed to mini-mize energy based on the instruction-level energy modelsfor the specific processors. Using similar energy models,in [9], several energy-oriented instruction scheduling ap-proaches are presented and compared with performance-oriented scheduling. However, these techniques are notgeneral enough to be applied to VLIW processors becauseof the specific architectures.
Several techniques [10–12] are proposed to minimizeswitching activities only. In [10], an instruction schedulingtechnique, called cold scheduling, is proposed to reduce theswitching activities on the control path. In [11], an operandsharing scheduling technique is proposed to schedule theoperation nodes with the same operands as closely as pos-sible to reduce the switching activities on the functionalunits. In [12], a scheduling algorithm for optimizing coeffi-cients of a FIR filter is proposed to minimize the switchingactivities on memory data bus and function units. Sinceschedule length is not considered, these techniques maygive inferior results for energy minimization.
In recent works [13, 14], the power efficient schedulingproblem is formulated as the Traveling Salesman’s Prob-lem (TSP) and solved by heuristics of TSP when thereis one FU. However, formulating a problem as TSP doesnot necessarily mean that the problem is NP-complete.For example, the problem to sequence jobs that requirecommon resources on a single machine [15] can be trans-fered to TSP but still polynomially solvable. In the litera-ture, it is still unknown that the instruction-level schedul-ing problem with minimum switching activities (calledminimum-switching-activities scheduling problem) is solv-able in polynomial time or NP-complete.
In our work, we formally prove minimum-switching-activities scheduling problem is NP-complete no matterthere are resource constraints or not. Based on this result,we further prove that instruction-level energy-minimizationproblem is NP-complete with or without resource con-straints. While minimum latency scheduling problem ispolynomial-time solvable if there is only one FU or noresource constraints, the problem becomes NP-completewhen considering switching activities as the second con-straint.
To solve the instruction-level energy-minimizationproblem on VLIW architectures, three algorithms areproposed. We first propose two algorithms, MSASand MLMSA, to solve two special cases of the energy-minimization problem. MSAS algorithm is designed for thecase when switching activity plays the most important rolein total energy consumption. MLMSA is for the case whenschedule length plays the most important role. In bothMSAS and MLMSA algorithms, we consider all nodes inthe ready list in each schedule step and use weighted bipar-tite matching to do scheduling and allocation simultane-ously. Then an algorithm, Energy Minimization Schedul-ing Algorithm (EMSA), is proposed to solve the generalinstruction-level energy-minimization scheduling problem.In EMSA, we start from an initial schedule and resched-ule the nodes to reduce switching activities when relaxingthe initial schedule to a given timing constraint. The bestschedule is selected from all possible schedules up to point.The experimental results show significant reductions in en-ergy consumption. When switching activity plays the mostimportant role in total energy consumption, MSAS gives areduction of 41.2% in switching activity over the standardlist scheduling on average. When schedule length plays themost important role in total energy consumption, MLMSAkeeps the same schedule length as list scheduling and re-duces 33.2% switching activities on average. For the gen-eral case, EMSA gives a 31.7% reduction in energy com-pared with the list scheduling on average.
The remainder of the paper is organized as follows.Section 2 introduces basic concepts and energy model.Section 3 proves instruction-level energy-minimizationscheduling problem is NP-complete. The algorithms arediscussed in Section 4. Experimental results and conclud-ing remarks are provided in Section 5 and 6, respectively.
2
2 Basic Concepts and Models
In this section, we introduce the basic concepts that willbe used in later sections. A motivational example and theenergy model are also introduced in this section.
2.1 DAG and Bit String( )
A Directed Acyclic Graph (DAG), -A , is a
node-weighted graph, where is the set of nodes and eachnode represents a sub-instruction, and A is the edgeset and an edge between two nodes denotes a dependencyrelation. In this paper, the operation time of each node inthe DAG is assumed to be one time unit.
Assembly code that consists of long instruction wordsgenerated by a compiler has to satisfy the dependency re-lations in the DAG. It consists of long instruction words.A long instruction word contains multiple sub-instructions.Each sub-instruction specifies on which functional unit it isgoing to be executed. The bit switches between the consec-utive sub-instructions executed on the same functional unitcause the switching power consumption ( ' *),+.-0/1 in equa-tion 1) in the power model introduced in Section 1 (the de-tail discussions about the power model will be given in Sec-tion 2.2). Therefore, we can capture the switching powerconsumption ( ' *),+.-0/1 ) through the computation of switch-ing activities.
We define a function, Bit String( ), to count the num-ber of bit switches for each node in DAG
.
Bit String( ) is a binary string that denotes the state of sig-nal after is executed. Without losing generality, it con-tains one or multiple of the following items:
opcodes. When loading instructions from programmemory, the state of instruction bus may be changed,which causes switching power consumption. Basedon a series of experiments we did on TMS320VC5410DSP processor, we collected the core current whenexecuting the loops of different instructions and in-struction combinations using similar method in [1].The experimental results show that the core current in-creases correspondingly with the increase of the ham-ming distance of consecutive instructions. Some ex-perimental results are shown in Figure 1. It confirmsthe correctness of our power model.
addresses. To accessing operands, the states of dataaddress bus may be changed and thus cause switchingpower consumption.
operands. When the inputs are loaded from data mem-ory and processed on functional units, the states ofdata bus, control paths in the functional units, etc.,may be changed and cause switching power consump-tion. Since applications in embedded systems are veryspecific, we can predict the possible values or patternsfor the inputs priori. Note that an operand here is apossible state but not a real input.
Don’t Care. Don’t Care means to keep the previousstate. For example, if we execute an add instructionwhose inputs are from two registers and output is putinto a register, then the states of data bus and data ad-dress bus are not affected. Thus, these states shouldbe kept.
MPY #1, AADD A, BSUB A, B
48.2 mA
MPY #1, AADD A, BADD A, B
46.6 mA 42.5mA
42.5mA
Instruction Loop The Core CurrentInstruction Loop The Core Current
ADD A, B
SUB A, B
SUB A,B ADD A,B
43.7mA
46.3mAMPY #1, A
Figure 1. Some experimental results for TITMS320C5410 DSP processor
A motivational example is shown in Figure 2- 4. Toschedule a Finite Impulse Response (FIR) filter on an ex-perimental VLIW processor, we show how Bit String() canbe used to denote the state of signal and how differentschedules can cause different bit switches. The architec-ture of our experimental VLIW processor is shown in Fig-ure 2(a). It is a simplified model and similar to one clusterof TI C6000 VLIW processors. It uses Harvard architec-ture with separate memory space for program and data. Forthe illustration purpose, we assume that program memoryhas a 30-bit instruction bus and data memory has a 4-bitaddress bus and data bus. Thus, in each clock cycle, a 30-bit long instruction word composed of 3 sub-instructionsis loaded to the decoder and the sub-instructions are redi-rected to corresponding functional units to be executed af-ter decoded. There are 3 functional units:
@ , ,
,where
@ can perform load/store or addition operation,
can perform multiplication or addition operation, and
can only perform addition operation. According to thespecifications of the functional units, a long instructionword has three possible combinations. One is to containone load/store instruction, one multiplication instruction,and one addition instruction. Another is to contain twoaddition instructions and one load/store or multiplicationinstruction. The third one is to contain 3 addition instruc-tions. The instructions and the corresponding opcodes areshown in Figure 2(b). We assume the operation time ofeach instruction is one time unit. Direct addressing is usedby load/store instructions. Multiplication ( ) and ad-dition ( ) operations are performed on general purposeregisters
. Their inputs are from two registers and
the result is stored into the second register. For example, - will store the result into
!.
3
A FIR filter is shown in Figure 3(a). Figure 3(b) showsthe addresses and the values of its coefficients ( )and its first 4 inputs ( ). The assembly codes,the corresponding opcodes and node names are shown inFigure 3(c), in which the coefficients and inputs are firstloaded into
and then computed. The DAG repre-
sentation for the assembly program is shown in Figure 3(d).Since bus capacitances are usually several orders of magni-tude higher than those of the internal nodes of a circuit [16],a considerable amount of power can be saved by reducingthe switching activities on the buses. Therefore, we willonly consider reducing the transitions on the instructionbus, data address bus and data bus in this example. For eachnode , Bit String(u) denotes the states of three buses af-ter executing node . It consists of three parts: the opcodeon the instruction bus, the operand on the data bus, and theaccess address on the data address bus as shown in Fig-ure 4(a). Usually we don’t know the inputs at the compil-ing time. However, we can predict their possible values orpatterns for a specific application. In this example, we as-sume our prediction for the operand part in 4+0< ' <+08 ) 1is totally correct. Since multiplication, addition and nopoperations can not influence the states of data memory, weput “—” under column “data addr.” and “data bus” to de-note the state of “Don’t Care”. Two different schedules areshown in Figure 4(b) and (c), respectively. Schedule 1 isobtained by traditional list scheduling and Schedule 2 is ob-tained by our MSAS algorithm. Assume the initial states ofbuses are all 0’s. Then the total number of bit switches foreach bus is the total number of transitions for a schedule.The number of transitions for Schedule 1 is 121 and thatfor Schedule 2 is 66. Since they have the same schedulelength, Schedule 2 consumes less energy than Schedule 1when executed on our experimental VLIW processor. Thegenerated codes with long instruction words can be foundin the column of instruction bus in Figure 4(b) and (c).
2.2 Power Cost Function and Energy Model
Our energy model is based on the methodology for theinstruction-level energy estimation framework for VLIWarchitecture proposed in [17–20]. In VLIW processorssuch as TI TMS320C620x and TMS320C670x DSP de-vices, there is no data cache and application programs areloaded into on-chip-memory before they are executed [21],therefore the influence of cache miss is not considered inour energy model.
As shown in equation 1, the power consumption to exe-cute a long instruction word during a clock cycle, ,can be computed by:
; !#" $%& !," $% (' ) +.-0/132
In VLIW processors, denotes the power to supportbasic pipeline execution even when a long instruction wordcontains only NOPs. !#" $ % denotes the basic power for afunctional unit to execute a sub-instruction on a specific
datapath even when the inputs don’t cause any transition.' *),+.-0/1 denotes the switching power caused by switch-ing activities between 598;:< 6 (current sub-instruction) and5=8;:<.> (last sub-instruction) executed on the same .
The switching power is proportional to the number oftransitions. So
' *),+.-0/1 )4+0< ' <+08 )75=8;:< +.1-+0< ' <+08 )75=8;:< / 11(2)
where is a power coefficient representing the con-sumed power per transition, and (Weighted Ham-ming Distance) is a function used to compute thenumber of transitions between 598;:<.6 and 5=8;:< > . Let = 4+0< ' <+08 )9598;:< +.1 and = 4+0< ' <+08 )75=8;:< / 1 , wedefine ) -*1 as:
) - ;1 !6
" 6 ) +#%$&' +#1 (3)
where " 6 is the weight of a transition. " 6 is used to de-note the weight for the power consumption caused by onetransition on different units.
The energy consumption of an application is the sum-mation of all its power consumption during each clock cy-cle. Let ' be a schedule for an application and @ the sched-ule length of ' . Then the energy consumption of Schedule' , A B , can be computed by
@
!," $ %
!#" $ %
!#" $ % '
) +.-/ 1 is the summation of basic power consumptions forall sub-instructions of an application that does not changewith different schedules. 4 is a constant and @varies with the schedule length for a specific VLIW proces-sor.
' *),+.-0/1 is the switching power and changes withdifferent schedules. Therefore, schedule length and switch-ing activity need to be considered together in instruction-level scheduling techniques in order to minimize energyconsumption of an application. One example is shown inFigure 5.
In Figure 5(a), a DAG is given, in which the binarystring close to each node is the value of Bit String( ).Given 2 functional units, and ( , three differentschedules are shown in Figure 5(b)-(d). An empty slot de-notes a NOP node. We assume a NOP node doesn’t causeany transition. Schedule 1 and Schedule 2 have the sameschedule length but Schedule 2 has less bit switches. Thuswe can easily identify Schedule 2 has less energy consump-tion than Schedule 1 has. To compare the energy consump-tions of Schedule 2 and Schedule 3, we need to considerthe schedule length. Assume that the power coefficient in equation 2 is 1 and is 1 in equation 2. The en-ergy for schedule 3 is ) +* andthat for Schedule 2 is , .- .So Schedule 3 uses less energy than Schedule 2. How-ever, when , , the energy for Schedule 3 is
4
0910192029
sub−instruction 1sub−instruction 0 sub−instruction 2
Docoder
ProgramController Reg. File
MemoryProgram
Data Memory
.L
.A .M
A0−A7
29
I0
Instr. Bus
I
D0 D1 D2 D3 A0 A1 A2 A4
Architecture:Data Bus Data Address
30−bit long instruction word:
(a)
0 01
Address
mpy reg1, reg2 1 1
reg1 reg21 1
01 1st reg, addr
ld (addr), reg
nop
add .M reg1, reg2 1 1
reg1 reg201
0 0 0 0 0 0 0 0 00
add .A reg1, reg2 01 1
reg20
reg. Address
reg.
reg1
add .L reg1, reg2 01 1
reg1 reg21
Instruction Opcode (10 bits)
(b)
Figure 2. (a) A VLIW processor. (b) The instruction set.
5
3 4 5281
9 12
7 6
Memory Address
A FIR Filter:
(b)
0000 0010
11011110
10100101011101101010
0001
011101100101010000110010
(b0)(b1)(b2)(b3)(x[0])(x[1])(x[2])(x[3])
10 11
14 13
15
Y[n]=b0 X[n]+b1 X[n−1]+b2 X[n−2]+b3 X[n−4]
(a)
Assembly Code
(c)
ld (0000), A0ld (0001), A1ld (0010), A2ld (0011), A3ld (0100), A4ld (0101), A5ld (0110), A6ld (0111), A7mpy A0, A7
Machine Code Node 123456789
add .A A5, A7add .L A6, A7 add .A A4, A5mpy A3, A4mpy A2, A5mpy A1, A6 10
1112131415
100 0000 000100 0001 001100 0010 010100 0011 011100 0100 100100 0101 101100 0110 110100 0111 1111111 000 1111111 001 1101111 010 1011111 011 100110 0 100 101110 1 110 111110 0 101 111
(d)
Figure 3. (a) A FIR filter (b) The coefficients and inputs in data memory (c) The assembly program,opcodes and node id (d) The DAG representation
6
21
3456789
1110
109
121114
Sch. Step
Sub−instru. 0Sub−instru. 00000000000 0000000000 0000000000
(1111001110)(1111000111)(1111011100)(1111010101)(1101110111)
13 (1100100101)
(1100101111)15
Instruction BusSub−instru. 0
00000001010101000110011100110010
00101010101001101110110101110101
0000 0000
Data Bus Data Addr.
12657843
(1000000000)(1000001001)(1000101101)(1000100100)(1000110110)(1000111111)(1000011011)(1000010010)
nop nopnop nopnop nopnop nopnop nop
nopnop
nopnopnop
nopnopnop
2
1
3
4
5
6
7
8
9
1
8
2
7
3
6
4
5
13
15
12
11
10
Step
Sch.
Sub−instru. 00000000000
(1000000000)
(1000111111)
(1000001001)
(1000110110)
(1000010010)
(1000101101)
(1000011011)
(1000100100)
(1111011100)
(1100100101)
(1100101111)
14
9
10
11
Sub−instru. 2Sub−instru. 1
(1111000111)
(1111001110)
(1101110111)
(1111010101)
Data Addr. Data BusInstruction Bus
0000000000 00000000000000
0111
0001
0110
0101
0010
0011
0100
0010
1101
1010
1110
1010
0101
0111
0110
0000 0000
nop nopnop nop
nopnopnopnopnopnopnopnop
nopnopnop
nopnopnop
(c)
(b)
21
345678
1011121314
Node
9
15
100 0000 000100 0001 001100 0010 010100 0011 011100 0100 100100 0101 101100 0110 110100 0111 1111111 000 1111111 001 1101111 010 1011111 011 100110 0 100 101
110 0 101 111110 1 110 111
Opcode0000 00100001 1010
0011 01110010 0101
0100 01100101 10100110 11100111 1101
00000 00000nop
Bit_String( i ) Data Addr. Data Bus
(a)
Figure 4. (a) Bit String() (b) Schedule 1 (c) Schedule 2
7
1 2 3 4
5 6
7
8
001
101
010
110
001
010 110
110
12345
1 2
7
Step FU1 FU2
5 63 4
12345
2 (010)
(110)
(001) (010)
(110)
(000)(000)
Step FU1 FU2
(001)1
7
3 45 6
(110)
8 (101)
(001)
(000)(000)
(010)
(110) (110)
(001) (010)
(110)
1234
1 2(000) (000)
(001) (010)
(001) (010)
Step FU1 FU2
5 6
7 (110)
(110)4(110)3
5
6 (101)8
(c) (d)(a) (b)
(101)8
Figure 5. (a) A given DAG (b) Schedule 1 with total bit switches 15 (c) Schedule 2 with total bitswitches 8 (d) Schedule 3 with total bit switches 4
) , * * and that for Sched-ule 2 is , , - . Then Schedule2 uses less energy. Thus, we have to consider both schedulelength and switching activity when comparing the energyconsumption of schedules.
2.3 Weighted Bipartite Matching
In our algorithms, we use the weighted bipartite match-ing graph to help find the binding of nodes in a DAG andresources. A Weighted Bipartite Graph
-A - is an edge-weighted undirected graph, where can bepartitioned into two sets and such that each edge ) -1! A implies and or and , and ) 1 represents the weight of edge A .
Given a bipartite graph -A - , a matching
M for , is a subset of E such that for each node ,
at most one edge of M is incident on . Hence, the weightof a matching is the summation of the weight of all edges inM. A maximum matching is a matching of maximum car-dinality, i.e., a matching M such that for any matching
,
we have
, where
denotes the cardinality ofM. A min-cost maximum matching is a maximum matchingwith a minimum weight.
2.4 The Optimization Problem
Our optimization problem is defined as follows: Givena DAG
-A , a function Bit String( ) for each node , and
functional units, find a schedule ' of
that
has minimum energy consumption based on energy modelin Section 2.2.
3 NP-Completeness
In this section, we prove that our optimization problemis NP-complete.
From the energy model in equation (2), we know theenergy consumption of a schedule is related to schedulelength and switching activity. To prove our optimizationproblem is NP-complete, we use the following method. Wefirst consider two subproblems of our optimization prob-lem: minimum-latency scheduling problem that only min-imizes schedule length and minimum-switching-activitiesscheduling problem that only minimizes switching activi-ties. If we can prove one of these two problems is NP-complete, we can further prove our optimization prob-lem is NP-complete. It is well-known that minimum-latency scheduling problem is polynomial-time solvablewhen there is only one functional unit or no resourceconstraints. Therefore, the focus is put on minimum-switching-activities scheduling problem. As discussed inSection 1, it is still unknown that minimum-switching-activities scheduling problem is solvable in polynomialtime or NP-complete in the literature. Thus, we catego-rize the problem into two cases as shown in Theorem 3.1and Theorem 3.2 and prove them as follows.
Theorem 3.1. Let be the number of resources, when , min-switching-activities scheduling problem is NP-complete.
In order to prove Theorem 3.1, we first define twodecision problems: one is the decision problem (DP1)of minimum-switching-activities scheduling problem when and the other is the @ Geometric Traveling Sales-man Problem (GTSP). We will transfer GTSP to our prob-lem in the proof.
DP1: Given a DAG -A , a function
Bit String( ) for each node , one resource, and oneconstant , does there exist a schedule that has switchingactivities at most ?
The @ geometric traveling salesman problem(GTSP) [22]: Given a set ' of integer coordinate pointsin the plane and a constant @ , does there exist a circuitpassing through all the points of ' which, with edge lengthmeasured by @ , has total length less than or equal to @ ?
8
Proof. It is obvious DP1 belongs to NP. Assume that' & - 0- - 0- - " - " #2 is an instance ofGTSP. Construct DAG
-A as follows: - - - " where 6 corresponds to a point 6 - 6 in ' and A
. Assume that max ) 6 1 and max ) 6 1 for
+ 8 , then Bit String( 6 ) ) 6 1 : 6 : ) 6 1 : 6 : for each 6 ) + 8 1 ,where “ ” denotes concatenation.For example, if , ( , and ,then Bit String( )=011 001. Set @ . Since GTSP isNP-complete and the reduction can be done in polynomialtime, DP1 is NP-complete.
Theorem 3.2. Let be the number of resources, when , min-switching-activities scheduling problem is NP-complete.
The decision problem (DP2) of minimum-switching-activities scheduling problem when is similar toDP1 except the number of resource is greater than 1 inDP2. The proof of Theorem 3.2 is as follows.
Proof. It is obvious DP2 belongs to NP. Assume that' & - 0- - 0- - " - " #2 is an instance ofGTSP. Construct DAG
-A as follows: - - - " where 6 corresponds to a point 6 - 6 in ' and A . Assume that max ) 6 1 and max ) 6?1 for
+ 8 , then Bit String(6 ) ) ( 1 : )6.1 : 6 : ) 6?1 : 6 : for each 6 ) + 8 1 . For example, if , ( ,and , then Bit String( )=11111111 011 001. Set @ . Set the initial Bit String() of each resource to all0’s.
The construction of makes all nodes in be assignedto the same resource for minimizing switching activities.Since the reduction can be done in polynomial time, DP2is NP-complete.
From Theorem 3.1 and 3.2, we know minimum-switching-activities scheduling problem is NP-complete.Then we use it to prove our optimization problem is NP-complete.
Theorem 3.3. Our optimization problem for instruction-level energy-minimization scheduling is NP-complete.
Proof. Given an instance of minimum-switching-activitiesscheduling problem, it can be directly mapped as an in-stance of our optimization problem by setting =0in the energy model. Since the reduction can be donein polynomial time, our optimization problem is NP-complete.
4 The Algorithms
In this section, we present three algorithms to solveinstruction-level energy-minimization scheduling problem.Two algorithms for two special cases are presented firstand then another algorithm to solve general problem is pro-posed.
4.1 MSAS Algorithm
4.1.1 The Algorithm
Our first algorithm, Minimizing Switching ActivitiesScheduling (MSAS), is designed to solve a special case ofinstruction-level energy-minimization scheduling problem,i.e., the case when the switching activities play the mostimportant role in energy consumption.
When is very small compared with (in equa-tion 2), the energy of a schedule depends mainly on switch-ing activities. For example, when equals 0.1 in Fig-ure 5, we need to reduce 10 control steps in schedule lengthto count one bit switch. Thus, we need an algorithm tominimize switching activities as much as possible. On theother side, considering the performance, we also want tominimize schedule length. Hence, MSAS algorithm mini-mizes switching activities in first priority and still considersschedule length. Since most previous works focus on onefunctional unit, we need algorithms that can take advan-tage of multiple functional units under VLIW architectures.MSAS algorithm is shown in Algorithm 4.1.
Algorithm 4.1 MSAS (Min-Switching-Activities Scheduling)
Input: DAG , Bit String( ) for each , func-tional unit set !#" $%'&
Output: A schedule with switching activities minimization(')+*,All ready nodes in ;
while( )+*.-0/ do
Construct 2143 5163789;: , where:5143 !#" $%'&=< (')+* ; >@?!#ABCDFE G4!#AH!#" $%'&%I (4)+*KJ
and :L?!#AM;DFD :NPOQ?CRSUT $VT@W@SUXY'?ZDM8RSUT $VT@W@SUXY'?! A DFD ;[\,\[ SUX ]9^%_@T RSC`KaVWbTcSUTed [ aVTgfghSiX2Y4?82143jD ;for all dk?!#Ag;Dl [ do
Schedule Node to functional unit !kARSUT $VT@W@SUXY'?! A D , RSUT $VT@W@SUXY'?ZD ;Remove from
(')+*;
Check all adjacent nodes of and put new ready nodesinto(')+*
;end for
end while
Due to the existence of the dependency in a DAG, wecan only schedule a node after all its parent nodes havebeen scheduled. The scheduling problem with switch-ing activities minimization is how to find a matching be-tween functional units and ready nodes in such a way thatthe schedule based on this matching minimizes the totalswitching activities in every scheduling step. This is equiv-alent to the min-cost weighted bipartite matching problem.Thus, in MSAS algorithm, we repeatedly create a weightedbipartite graph
between the set of functional unit andthe set of nodes in Ready List, and assign nodes basedon the min-cost maximum bipartite matching
. In each
scheduling step, the weighted bipartite graph,
-A - , is constructed as follows: ' Amn @po6q where ' Am - - - sr is the set of all functional units and @to6q is the set of all
9
nodes in Ready List; for each FU 6 ' Am and eachnode @ o6q , an edge ) 6 - 1 is added into A and ) 6 - 1 ) 4+0< ' <+08 ) 1- 4+0< ' <+08 ) 6 11 ,where ) - *1 is weighted humming distance func-tion as defined in equation 3 and 4+0< ' <+08 ) 6 1 4+0< ' <+08 ) 1 where is the last node executed on 6 .
Since bit switches are set as the weight of edge, aschedule based on this matching optimally minimizes theswitch activities at every scheduling step. All nodes in
are scheduled at every scheduling step, so the sched-ule length can not be too big. After scheduling a node to a FU 6 , we update the state of signal on 6 by setting4+0< ' <+08 )6 1 4+0< ' <+08 ) 1 . Note that there mayexist “Don’t Care” in 4+0< ' <+08 ) 1 . In such case, weneed to put the current state instead of “Don’t Care” in or-der to correctly count bit switches on 6 . An example isshown in Figure 6.
Given the DAG in Figure 5(a), the scheduling in thefirst step by
' ' algorithm is shown in Figure 6 whenthere are 2 FUs. Ready List of the DAG in the first stepis shown in Figure 6(a). Assume that the initial states ofFUs are 000 and " 6 in weighed hamming distancefunction )01 . A weighted bipartite graph based onthe set of FUs and the set of nodes in Ready List is con-structed in Figure 6(b). A min-cost maximum bipartitematching is shown in Figure 6(c). The schedule of the firststep is shown in Figure 6(d). After scheduling Node 1 to and Node 2 to , we change the states of the FUs.The finial schedule generated by
' ' is shown in Fig-ure 5(c). Compared with Schedule 1 in Figure 5(b) gener-ated by traditional list scheduling, Schedule 2 has less bitswitches.
It is known that finding a min-cost maximum bipartitematching takes ) 8 1 by the Hungarian Method [23]. Let
be the number of functional units. In every schedul-ing step, we need at most )) 1 1 to find mini-mum weight maximum bipartite matching using Hungar-ian Method and the scheduling step is at most
. Thus,the complexity of MSAS is ) ) 1 1 .4.1.2 Applying
' ' Algorithm to VLIW architec-ture
Our generic algorithm can be easily modified to solve thescheduling problem on various VLIW architectures. In thissection, we show how to apply
' ' to the VLIW archi-tecture similar to TI C6000 VLIW processor. In such ar-chitecture, a sub-instruction can be put into any location inthe instruction bus. It is the decoder that distributes the sub-instruction to a FU. So we need to change our generic algo-rithm a little bit to handle this case. The VLIW architectureconsists of heterogeneous FUs with resource constraints.Thus, the number of sub-instructions of a type executingin a clock cycle can not exceed the resource bound. Forexample, since there is only one load/store unit in the ex-perimental VLIW processor in Figure 2, we can not sched-ule more than one load/store instruction at the same sched-
ule step. Assume that at most
sub-instructions of atype can be executed in a clock cycle. If there are morethan
sub-instructions of this type in the bipartite match-
ing in a schedule step in ' ' , we can sort the edges
with this type of sub-instructions by an increasing orderand schedule these sub-instructions from the first
edges.
Then we can remove all sub-instructions of this type from @ +:< and find the best matching for the left FUsand left nodes till every FU has been assigned a node. Insuch a way, we can reduce switching activity as much aspossible. We also need to add NOP nodes if the number ofnodes is less than the number of FUs in each step. Using6 to denote sub-instru + in the instruction bus and applyingabove, Schedule 2 in Figure 4(c) is generated by
' ' . Itreduces bit switches from 121 of Schedule 1 to 66.
4.2 MLMSA Algorithm
Our second algorithm, Minimizing Latency with Min-imal Switching Activities Scheduling (MLMSA), is de-signed to solve another special case of instruction-levelenergy-minimization scheduling problem, i.e., the casewhen is very big compared with in equation 2(the switching power caused per transition). In this case,the schedule length plays the most important role in en-ergy consumption. Thus, MLMSA is designed to generatea schedule with minimum schedule length in the first pri-ority and then reduce switching activities as much as pos-sible.
MLMSA is similar to MSAS except that we use the dif-ferent method to assign weights to the edges of weightedbipartite graph. In order to reduce the schedule length,we define Depth( ), a priority function for each node in a DAG, to be the longest path from to a leaf nodein a DAG. When constructing weighted bipartite graph , for each edge )6 - 1 , its weight is setas ) - 61 ) 4+0< ' <+08 ) 1-+0< ' <+08 ) 6?11b" 4< ) 1 @ 8 , where 4< ) 1 is the prior-ity value of node as mentioned above, and @ 8 is the bitlength of 4+08 B <+08 ) 1 , and " max 6 " 6 foreach " 6 in equation 3.
Using this method, the priority of a node is dominant inedge weight in our bipartite graph
. Thus, the nodewith the highest priority in list @ o6q will be scheduled first.However, for the nodes with same priority, switching ac-tivities minimization is considered. Thus, we minimize theschedule length while considering switching activities.
4.3 EMSA Algorithm
In this section, an algorithm, Energy MinimizationScheduling Algorithm (EMSA), is proposed to solve thegeneral instruction-level energy-minimization schedulingproblem. The basic idea is to reschedule nodes to reduceswitching activities when relaxing the schedule length of aninitial schedule to a given timing constraint and then select
10
Head 2 3 41
2
3
4
1
(001)
(010)
(110)
(000)
(000) 2
2
1
1
22
1
FU2
FU1
(110)
1
V1 V2
M = (FU1, 1) , (FU2, 2)(weight=2 )
(001) FU1 <−− 1 (010)FU2 <−− 2
(c)
(a)
(d) (b)
Figure 6. (a) Ready List (b) Weighted Bipartite Graph (c) Min-cost Max Bipartite Matching (d) Theschedule for first step
the schedule with minimal energy consumption. A ' isshown in Algorithm 4.2.
Algorithm 4.2 EMSA (Energy Minimization Scheduling Algo-rithm)
Input: DAG , Bit String( ) for each , func-tional unit set ! " $s & , an initial schedule $%A%A , a timing con-straint & ]
Output: A schedule with minimum energy( ,the schedule length of $ A%A ; S , 1;
for all T ( to &p] do$sA ,
Relax Reschedule( , RSUT $VT@W@SUXY , ! " $s & ,$sA%A ,t);
end forSelect the schedule with minimum energy among $6Ag? S ? &p] ( D ;
In EMSA, a function, 4 : , is called to
generate a new schedule for each < from @ to m , where@ is the schedule length of the given initial schedule andm is the timing constraint. The schedule generated by 4 : has minimum switching activities be-cause the switching activities are reduced as much as pos-sible in
6 : . For each schedule, we com-pute its energy consumption based on equation 2. Then theschedule with minimum energy consumption is selected asthe output.
Function 6 : reschedules nodes to re-
duce switching activities during relaxing the initial sched-ule as shown in Algorithm 4.3.
6 : worksmostly like As-Late-As-Possible scheduling (ASAP). In 4 : , we reschedule nodes to new loca-tions based on an initial schedule and a timing con-straint from the bottom up. Since the timing constraintis equal to or greater than the schedule length of theinitial schedule, the nodes in the initial schedule mayhave some freedom to be moved to a new location such
Algorithm 4.3 Relax Reschedule()
Input: DAG , Bit String( ) for each , func-tional unit set !#" $%'& , an initial schedule $%A%A , a timing con-straint & ]
Output: A schedule with switching activities minimizationfor all do( aVT dg_@T $ TedM` ?ZD , &p] ; ^%_@T 5W'd@W+?ZD , the number of child nodes of ;end for d#a $'dbT , all nodes with
^%_@T W d@W in ;while
dka $ d@T - / dofor all in
dka $ d@T do]ptW $ TedM` , current schedule step of in $%A%A ;( ^sf $ dbT , all available locations that can be put intofrom ]ptW@W $ TedM` to
( aVT dg_@T $ TedM` ?Z D ;for all U^%f A in
( ^%f $'dbT do$ :=?U^%f AFD , the caused switching activities if ismoved to 8^%f A ;
end for[ SUX $ :=?ZD ,the minimum value among all
$ :=?U^%f AFD ;R d _@T ( ^%fV?ZD , 8^sfeA closest to the bottom with$ :=?U^%f AFDp [ SiX $V:=?ZD ;
end for , the node with minimum[ SUX $V:=?ZD among all nodes
in dka $'dbT ;
Reschedule to R d _@T ( ^%fV? D ;for all parent node of in do$6fgh $ TedM` , the schedule step of ;if( a Ted _@T $VT dF` ? D $%f h $VT dF`!
then( aVT dg_@T $ TedM` ? " D , $%f h $VT dF`# ;
end if ^%_@T 5W'd@W+? D$% ;end forRemove from
dka $ d@T and put new ready nodes with ^%_@T 5W'd@W & into dka $ d@T ;
end while
11
that the switching activities can be reduced. In order toobey the precedence relation, we associate two variables,@ < :< ' < ) 1 and :< ) 1 , to each node
, where @ < :< ' < ) 1 keeps track of the latest sched-ule step, and :< ) 1 keeps track of the number ofchild nodes of that have not been rescheduled yet. Weput all nodes with :< into set
' <and compute the best location of each node with two con-siderations: 1) it causes the greatest reduction in switchingactivity; 2) it is the closest to the bottom. Then we selectthe node with the greatest reduction in switching activityamong all nodes in
' < and reschedule it to its bestlocation. In this way, we can reduce switching activities asmuch as possible when rescheduling a node.
Since the schedule generated by A ' is directly re-lated to the initial schedule. It is very important to have agood initial schedule. Both
' ' and @ ' greatly
reduce switching activities, so we can use the schedulesgenerated by them as the initial schedules of A ' . Forexample, using Schedule 2 in Figure 5(c) as an initialschedule and assuming the timing constraint is 6 time units,Schedule 2 and Schedule 3 (Figure 5(d)) are obtained byA ' . The best schedule is selected based on our energymodel shown in equation 2. When the power coefficient (equation 2) is 1 and 4 is 1 in equation 2, Schedule 3is selected as the finial schedule.
5 Experiments
We experimented with our algorithms on a set of bench-marks including 2-cascaded Biquad filter, 4-stage latticefilter, 8-stage lattice filter, differential equation solver, el-liptic filter and voltera filter. When doing experiments withA ' , the schedules generated by
' ' and @ '
are used as the initial schedules, and the best results are se-lected. When computing the switching power, we assume in equation 2 and " 6 for each + in equation 3.The results are compared with those of the traditional listscheduling algorithm. In our experiments, the opcodes foreach node is obtained from TI TMS320C6000 InstructionSet [24].
The experimental results for ' ' and list scheduling
are shown in Table 1. Column “SA” lists the switching ac-tivities for list scheduling (field “ListS”) and
' ' (field“MSAS”) when the number of functional units varies from
to ) . Column “SL” lists the schedule length. Column “%”under
' ' lists the percentage of reduction in switchingactivity compared with list scheduling. The average reduc-tion is shown on the last row. Similarly, the experimentalresults for
@ ' and list scheduling are shown in Table2.
The experimental results in Table 3 and Table 4 showthe comparison on energy for A ' and list schedulingwhen and ( , respectively. Row “E.”lists the energy for various benchmarks. Row “SL” liststhe schedule length obtained by list scheduling. Row “TC”lists the timing constraint used in A ' . Each benchmark
is scheduled with three timing constraints. Among them,the first one is the schedule length obtained by list schedul-ing. Row “%” in field “EMSA” lists the reduction on en-ergy comparing A ' with list scheduling (field “ListS”).In each table, the results are shown when the number ofFUs equals 3,4,and 5, respectively.
Through the experimental results from Table 1-4, wefound our algorithms reduce energy consumption signifi-cantly compared with list scheduling. When switching ac-tivity plays the most important role in energy consumption, ' ' gives a reduction of 41.2% in switching activityon average compared with list scheduling. When sched-ule length plays the most important role in energy con-sumption,
@ ' gives the same schedule length as listscheduling and achieves a reduction of 33.2 % in switch-ing activity on average at the same time. For general cases,A ' gives a reduction of 31.2% in energy on averagecompared with list scheduling.
113
17 18 19 20 21 22 23 24 25
List
6865
59
90
Time* *
EMSA
8 Lattice IIR Filter (FUs = 4 )
Energy
Figure 7. The energy variation of 8-lattice IIRfilter with the increase of the timing con-straint when .
From the experimental results in Table 3 and Table 4, wecan see the energy based on A ' is reduced correspond-ingly with the increase of the timing constraint for eachbenchmark. To show the further energy variation, we showthe results for 8-lattice IIR filter when the timing constraintvaries from 17 to 25 in Figure 7. A ' gradually reducesthe energy when the timing constraint increases from 17to 23. After that, the energy can not be reduced anymore.This shows that we can not gain further energy reductionwhen the timing constraint reaches a certain point. At thispoint, A ' gives a reduction of 47.8% compared withlist scheduling when the timing constraint equals 23.
6 Conclusion
In this paper, we presented instruction level schedul-ing techniques to minimize the energy consumption of ap-plications executed on VLIW processors. We studied thetwo most important factors, switching activity and sched-
12
Bench. FUs=3 FUs=4 FUs=5 FUs=6ListS MSAS ListS MSAS ListS MSAS ListS MSAS
SA SL SA SL % SA SL SA SL % SA SL SA SL % SA SL SA SL %2-Cas-Biq 7 54 8 27 50.0 6 40 7 33 17.5 6 45 7 28 37.8 6 45 7 28 37.84-Lat-IIR 9 64 12 35 45.3 9 69 11 35 49.3 9 68 10 37 45.6 9 62 10 34 45.28-Lat-IIR 17 96 23 43 55.2 17 96 22 43 55.2 17 102 20 43 57.8 17 103 19 45 56.3DEQ 5 40 6 21 47.5 5 30 5 27 10.0 5 35 5 27 22.9 5 35 5 27 22.9Elliptic 15 143 16 90 37.1 14 142 14 93 34.5 14 142 14 85 40.1 14 142 14 85 40.1Voltera 12 60 15 31 48.3 12 57 14 29 49.1 12 56 13 32 42.9 12 57 13 34 40.4
Average Reduction % 47.2 – 35.9 – 41.2 – 40.4
Table 1. The comparison on switching activity and schedule length for ' ' and List Scheduling
when the number of FUs varies from 3 to 6.
Bench. FUs=3 FUs=4 FUs=5 FUs=6ListS MSAS ListS MSAS ListS MSAS ListS MSAS
SA SL SA SL % SA SL SA SL % SA SL SA SL % SA SL SA SL %2-Cas-Biq 7 54 7 28 48.1 6 40 6 29 27.5 6 45 6 23 48.9 6 45 6 30 33.34-Lat-IIR 9 64 9 46 28.1 9 69 9 48 30.4 9 68 9 40 41.2 9 62 9 38 38.78-lat-iir 17 96 17 71 26.0 17 96 17 73 24.0 17 102 17 57 44.1 17 103 17 47 54.4DEQ 5 40 5 31 22.5 5 30 5 27 10.0 5 35 5 27 22.9 5 35 5 27 22.9Elliptic 15 143 15 101 29.4 14 142 14 93 34.5 14 142 14 85 40.1 14 142 14 85 40.1Voltera 12 60 12 44 26.7 12 57 12 41 28.1 12 56 12 37 33.9 12 57 12 34 40.4
Average Reduction % 30.1 – 25.7 – 38.5 – 38.3
Table 2. The comparison on switching activity and schedule length for @ ' and List Scheduling
when the number of FUs varies from 3 to 6.
2-Cas-Biq 4-Lat-IIR 8-Lat-IIR DEQ Elliptic Voltera :
ListS SL 7 9 17 5 15 12E. 61 73 113 45 158 72
EMSA TC 7 8 9 9 11 14 17 20 24 5 6 7 15 18 21 12 14 17E. 35 30 28 55 49 41 88 75 66 36 27 26 116 96 89 56 55 47% 42.6 50.8 54.1 24.7 32.9 43.8 22.1 33.6 41.6 20.0 40.0 42.2 26.6 39.2 43.7 22.2 23.6 34.7
: *ListS SL 6 9 17 5 14 12
E. 46 78 113 35 156 69EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17
E. 35 31 29 57 46 39 90 68 59 32 28 28 107 79 74 53 43 43% 23.9 32.6 37.0 26.9 41.0 50.0 20.4 39.8 47.8 8.6 20.0 20.0 31.4 49.4 52.6 23.2 37.7 37.7
: ,ListS SL 6 9 17 5 14 12
E. 51 77 119 40 156 68EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17
E. 29 29 29 49 43 43 74 63 58 32 27 27 99 82 82 49 45 44% 43.1 43.1 43.1 36.4 44.2 44.2 37.8 47.1 51.3 20.0 32.5 32.5 36.5 47.4 47.4 27.9 33.8 35.3
Table 3. The comparison on energy for A ' and List Scheduling ( 4 ) when the number ofFUs varies from 3 to 5.
13
2-Cas-Biq 4-Lat-IIR 8-Lat-IIR DEQ Elliptic Voltera
: ListS SL 7 9 17 5 15 12
E. 68 82 130 50 173 84EMSA TC 7 8 9 9 11 14 17 20 24 5 6 7 15 18 21 12 14 17
E. 42 38 37 64 59 54 105 94 89 41 33 32 131 114 108 68 68 61% 38.2 44.1 45.6 22.0 28.0 34.1 19.2 27.7 31.5 18.0 34.0 36.0 24.3 34.1 37.6 19.0 19.0 27.4
: *ListS SL 6 9 17 5 14 12
E. 52 87 130 40 170 81EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17
E. 41 38 37 66 57 51 107 90 82 37 34 34 121 95 92 65 57 57% 21.2 26.9 28.8 24.1 34.5 41.4 17.7 30.8 36.9 7.5 15.0 15.0 28.8 44.1 45.9 19.8 29.6 29.6
: ,ListS SL 6 9 17 5 14 12
E. 57 86 136 45 170 80EMSA TC 6 7 8 9 11 14 17 20 24 5 6 7 14 18 21 12 14 17
E. 35 35 35 58 53 53 91 81 79 37 33 33 113 98 98 61 59 59% 38.6 38.6 38.6 32.6 38.4 38.4 33.1 40.4 41.9 17.8 26.7 26.7 33.5 42.4 42.4 23.8 26.3 26.3
Table 4. The comparison on energy for A ' and List Scheduling ( ( ) when the number ofFUs varies from 3 to 5.
ule length, that influence the energy consumption of anapplication on a VLIW processor during scheduling. Weproved that instruction-level energy-minimization schedul-ing problem is NP-complete with or without resource con-straints. We proposed three heuristic algorithms, MSAS,MLMSA, and EMSA, to solve the problem. While switch-ing activity and schedule length are given higher priorityin MSAS and MLMSA respectively, EMSA gives the bestresult considering both of them. The experimental resultsshow our algorithms can significantly reduce energy con-sumption compared with the standard list scheduling.
References
[1] V. Tiwari, S. Malik, and M. Fujita, “Power analy-sis of embedded software: A first step towards soft-ware power minimization,” in Proceedings of theIEEE/ACM International Conference on ComputerAided Design, Nov. 1994, pp. 110–115.
[2] N. Chang, K. Kim, and H. G. Lee, “Cycle-accurateenergy measurement and characterization with a casestudy of the ARM7TDMI,” IEEE Trans. on VLSI Sys-tems, vol. 10, no. 2, pp. 146–154, Apr. 2002.
[3] J.M. Chang and M. Pedram, “Register allocationand binding for low power,” in Proceedings ofthe IEEE/ACM Design Automation Conference, June1995, pp. 29–35.
[4] M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fu-jita, “Power analysis and low-power scheduling tech-niques for embedded dsp software,” in Proceedings
of the IEEE International Symposium on System Syn-thesis, Sept. 1995, pp. 110–115.
[5] C. Lee, J.-K. Lee, and T. Hwang, “Compiler opti-mization on instruction scheduling for low power,” inProceedings of the IEEE International Symposium onSystem Synthesis, Sept. 2000, pp. 55–60.
[6] Hongbo Yang, , Guang R. Gao, and Clement Leung,“On achieving balanced power consumption in soft-ware pipelined loops,” in Proceedings of the Inter-national Conference on Compilers, Architectures andSynthesis for Embedded Systems, 2002, pp. 210–217.
[7] A. Raghunathan and N. K. Jha, “An ILP formulationfor low power based on minimizing switched capac-itance during data path allocation,” in Proceedingsof the IEEE International Symposium on Circuits andSystems, May 1995, pp. 1069–1073.
[8] L. Kruse, E. Schmidt, G. Jochens, A. Stammermann,A. Schulz, E. Macii, and W. Nebel, “Estimation oflower and upper bounds on the power consumptionfrom scheduled data flow graphs,” IEEE Trans. onVLSI Systems, vol. 9, no. 1, pp. 3–14, Feb. 2001.
[9] A. Parikh, M. Kandemir, N. Vijaykrishnan, and M. J.Irwin, “Instruction scheduling based on energy andperformance constraints,” in IEEE Computer SocietyAnnual Workshop on VLSI, Apr. 2000, pp. 37–42.
[10] C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Savingpower in the control path of embedded processors,”IEEE Design & Test of Computers, vol. 11, no. 4, pp.24–30, Winter 1994.
14
[11] E. Musoll and J. Cortadella, “Scheduling and re-source binding for low power,” in Proceedings of theIEEE International Symposium on System Synthesis,Apr. 1995, pp. 104–109.
[12] M. Mehendale, S.D. Sherlekar, and G. Venkatesh,“Coefficient optimization for low power realizationof fir filters,” in IEEE Workshop on VLSI Signal Pro-cessing, 1995, pp. 352–361.
[13] K. Masselos, S. Theoharis, P. K. Merakos,T. Stouraitis, and C. E. Goutis, “Low powersynthesis of sum-of-products computation,” in Pro-ceedings of the IEEE/ACM International Symposiumon Low Power Electronics and Design, July 2000,pp. 234–237.
[14] K.W. Choi and A. Chatterjee, “Efficient instruction-level optimization methodology for low-power em-bedded systems,” in Proceedings of the IEEE Inter-national Symposium on System Synthesis, Oct. 2001,pp. 147–152.
[15] J.A.A. V. D. Veen and S. Zhang G. J. Woeginger, “Se-quencing jobs that require common resources on asingle machine: a solvable case of the tsp,” Math-ematical Programming, pp. 235–254, 1998.
[16] E. Macii, M. Pedram, and F. Somenzi, “High-level power modeling, estimation and optimization,”IEEE Transactions on Computer-Aided Design of In-tegrated Circuits and Systems, vol. 17, pp. 1061–1079, November 1998.
[17] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zac-caria, and R. Zafalon, “A power modeling and estima-tion framework for VLIW-based embedded systems,”in Proceedings of the International Workshop-Powerand Timing Modeling, Optimization and Simulation,PATMOS’01, 2001, pp. 26–28.
[18] A. Bona, M. Sami, D. Sciuto, C. Silvano, V. Zaccaria,and R. Zafalon, “Energy estimation and optimizationof embedded VLIW processors based on instructionclustering,” in Proceedings of the IEEE/ACM DesignAutomation Conference, 2002, pp. 886–891.
[19] M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria, “In-struction level power estimation for embedded VLIWcores,” in Proceedings of the International Workshopon Hardware/Software Codesign, May 2000, pp. 34–38.
[20] M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria,“Power exploration for embedded VLIW architec-ture,” in Proceedings of the IEEE/ACM InternationalConference on Computer Aided Design, Nov. 2000,pp. 498–503.
[21] Texas Instruments, Inc., TMS320C6000 PeripheralsReference Guide (Rev. D), 2001.
[22] M. R. Garey and D. S. Johnson, Computersand Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Company, 1979.
[23] H.A.B. Saip and C. L. Lucchesi, “Matchingalgorithm for bipartite graphs,” Tech. Rep. DCC-93-03(Departamento de Cincia da Computao,Universidade Estudal de Campinas), March 1994,http://www.dee.unicamp.br/ic tr ftp/ALL/Abstrace.html.
[24] Texas Instruments, Inc., TMS320C6000 CPU and In-struction Set Reference Guide, 2000.
15
1
Robust Transmission of Wavelet Video Sequence Using Intra Prediction Coding Scheme
Joo-Kyong Lee, Ki-Dong Chung
Department of Computer Engineering, Pusan National University
[email protected], [email protected]
Abstract
A wavelet-based video stream is more susceptible to the network transmission errors than DCT-based
video. This is because bit-errors in a subband of a video frame affect not only the other subbands within
the current frame but also the subsequent frames. In this paper, we propose a video source coding scheme
called Intra Prediction Coding (IPC) in which each subband refers to the 1-level lower subband in the
same orientation within the same frame to reduce the error propagation to the subsequent frames. We
evaluated the performance of our proposed scheme in the simulated wireless network environment. As a
result of tests, it is shown that the proposed algorithm shows better performance than MRME on a heavy
motion image sequence while IPC outperforms MRME at high bit-rate on a little motion image sequence.
Keywords
Intra Prediction Coding, Wavelet-based video, Error Propagation, Error Resilience, MRME
1. Introduction
As the personal mobile devices are more and more popularized, network environment to transmit the
multimedia data is being moved from the current wired network to the wireless network. There has been a
lot of research in video coding to overcome the limited wireless network resources and some of them are
published as a part of DCT-based international standard such as H.263 [1] and MPEG-4 [2]. Besides,
several wavelet transform based image codecs [3] and several video codecs [4][5][6] have been studied
since they are efficient at a low bit-rate.
Most DCT-based standard video encoders perform motion compensated prediction to reduce temporal
redundancy between the current frame and the reference frame. However, this scheme causes error
propagation when a frame is corrupted during transmission over the network. To improve this problem,
the H.263 and the MPEG-4 standards include various error control schemes [7][8] as annex. The error
propagation problem in the wavelet-based video is more serious than that of DCT-based video. In Fig.1,
(a) shows an original image S of size M×N pixels and (b) shows a 3-level wavelet decomposition of the
image S, respectively. In (b), S is decomposed into totally 10 subbands; in the first level of decomposition,
one low pass subband and three orientation high pass subbands (LH1, HL1, HH1) are created. In the
second level of decomposition, the low pass subband is further decomposed into one low pass and three
2
high pass subbands (LH2, HL2, HH2). This process is repeated on the low pass subband to form higher
level wavelet decomposition. The inverse wavelet transform is calculated in the reverse manner to
reconstruct the original image. However, in case of corruption of the higher level subbands in the
transformed image during transmission over the network, the effect of the errors propagates to the lower
level subbands and the subsequent frames of the motion compensated video.
M/2M/4M/8
N/8
N/4
N/2
HL1
HH1LH1
LH2
LH3
HL3LL3
HH3
HL2
HH2
Horizontal size = M
Verticalsize= N
Origianl Image S
Fig. 1 Original image before wavelet transform (a) and 3-level wavelet transformed image
So far, there has been some research on error control for a wavelet-based image or video, but these are
limited to SPIHT (Set Partitioning In Hierarchical Trees) [9][10] scheme, none have been studied for
motion compensated wavelet-based video. In this paper, we propose a source coding scheme called Intra
Prediction Coding to reduce error propagation of a motion compensated wavelet-based video. In the
proposed scheme, each subbands estimates motion vectors from the 1-level lower subbands in the same
orientation within the same frame. However, the LLM1)
subband and the first level subbands (LH1, HL1,
HH1) are exempt from this rule. That is to say, the LLM subband performs motion estimation from the
LLM subband of the reference frame to exploit the high similarity between the two subbands, and the first
level subbands do not perform motion estimation from any other subbands by considering coding
efficiency and computation complexity. In the proposed scheme, if the LLM subband is protected from the
transmission errors even though the others are corrupted, the subsequent frames are free from the error
propagation.
This paper is organized as follows. In the next section, we explain MRME (Multi-Resolution Motion
Estimation) scheme, which perform motion compensated prediction from a previous frame. In Section 3,
we describe the detail procedures of IPC. And then, we present the results of the various simulations in
the error-prone environment. We conclude the paper with discussions and conclusions in the last section.
1) LLM is the low pass subband in the M-level wavelet transformed image
3
2. MRME (Multi-Resolution Motion Estimation)
MRME is a motion estimation scheme first proposed by Zhang and Zafar [4] as an alternative to the
conventional block-matching scheme and subband-to-subband prediction scheme to reduce the
computational load [11]. MRME estimates the motion vectors hierarchically from the higher level
subbands to lower level subbands and utilizes the motion information from the highest level subband(s) to
predict those from the lower subbands to reduce the computation load. There are several possible
variations in motion estimation according to selection of the highest-level subbands as a basis and
refinement process. Zhang and Zafar have implemented four schemes among all the possible variations.
In this paper, we use scheme III, in which MVs (Motion Vectors) in the LLM subband are calculated and
properly scaled as the initial biases for refining the MVs in all other subbands, to describe the algorithm
and to compare with our proposed scheme.
,5MV
,52 MV×
1,5MV −
1,5MV −∆M
2 2 : block size of a th-level subband : wavelet transform level: the highest wavelet transform level
: block size of LL subband
M m M mp q mmMp q
− −⋅ × ⋅
×
LLM
LHM-1
Fig. 2 Motion estimation of the lower level subband using MVs of the LLM subband
In MRME scheme, motion estimation consists of three major phases. Firstly, all the blocks of the LLM
subband estimate motion vectors from LLM subband of the reference frame. The detail estimation method
does the same as the conventional block-matching motion estimation scheme does. Secondly, the MVs
calculated in the first phase are properly scaled in proportion to the size of lower level subbands. Lastly,
the scaled MVs are refined through motion estimation within a reduced search area. Fig.2 shows an
example of MRME scheme. In the figure, to predict MV VM-1,5, VM,5 in the LLM subband is expanded to
2×VM,5 and the incremental motion vector, δ(x,y) is calculated within a reduced search area centered at
2×VM,5. Eq. (1) describes the calculation of all the MVs except the LLM subband where i is the wavelet
transform level, j is the location of subbands and (x, y) coordinate is block index in the LLM subband.
, ,0( , ) 2 ( , ) ( , ) (1)
for 0, 1, 2, ..., and 1, 2, 3
M ii j MV x y V x y x y
i M j
δ−= +
= =
4
3. Intra Prediction Coding
Fig. 3 shows that how IPC works in the 3-level wavelet transformed video frames. The LL3 subband of
a video frame contains a large percent of the total energy though it is only 1/64 of the original video frame
size. And it is similar to the LL3 of the previous frame since the subband displays the coarse contour of its
frame. Therefore, we predict MVs of LL3 of the current frame from LL3 of the previous frame by
considering coding efficiency. Other subbands except the first level subbands perform motion
compensated prediction from the 1-level lower subbands in the same orientation within the same frame.
In the wavelet-based video, there is a big difference between the lower level subband coefficients of the
shifted image and those of the original image. Such phenomena will happen frequently around the image
edges [14]. On this account, the first level subbands are exempt from the motion estimation from a
previous frame.
F(t-1) F(t)
LH2 HH2
HL2
LH1 HH1
HL1
reference direction
LL subband in the highest transform level
subbands used for motion estimated prediction
HL3
LH3
LL3
HH3
LL3
Fig. 3 Example of IPC on the 3-level transformed frames
3.1. Motion Estimation for the LLM subband
As we mentioned previously, The LLM subband is more important than any other subbands in a wavelet
transformed frame and contains very large coefficient values. The motion estimation process of this
subband is as fallows. Firstly, the subband is divided into blocks of same size of px×py except boundary
area. Next, for every block, Eq. (2) is applied to calculate the similarity between the current block and a
candidate block of the reference frame. In the equation, the cost function as the Sum of Absolute
Differences or SAD is defined in order to determine the differences between the current block and the
candidate blocks. Let a pixel of the current block in the current frame be denoted as LLtM(x+k, y+l) where
(x, y) is the coordinate of the left top corner of the current block and (k,l) is the relative location from (x,y)
5
in the block. And let a pixel of LL subband in the reference be LLt-1M(x+k+i, y+l+j) where ,p i j p− ≤ ≤ ; p
is the search area. More specifically, Eq. (2) sums up all the absolute differences between the
corresponding pixels of the current block and the candidate block that is located (i, j) away from the block
in the reference frame. In the results of Eq. (2), We get a number of SAD values according to the size of
the search range p and select the smallest value and its relative location from the current block as the MV
of the current block.
1 11
0 0
( , ) | ( , ) ( , ) | (2)
( , ) arg min ( , ), ,
px pyt tM M
k l
SAD i j LL x k y l LL x k i y l j
MV k l SAD i j p i j p
− −−
= =
= + + − + + + +
= − ≤ ≤
∑ ∑
3.2. Motion Estimation for Other subbands
After the motion estimation of the LLM subband, IPC calculates the MVs of the same level subbands,
LHM, HLM, HHM, and exploits these MVs to predict MVs of 1-level lower subbands. The motion
estimation procedure for the highest level subbands is shown in Eq. (3). There are three differences
between IPC and the conventional motion estimation scheme. First, all the subbands except LLM and the
first level subbands refer to the 1-level lower subbands in the same orientation within the same frame. For
instance, the subband s of level M refers to the subband s of level M-1. Secondly, because the coefficients
of the current subbands are larger than those of the reference subbands, we need a constant K to
complement the differences. Thirdly, the width and height of the reference subbands double those of the
current subbands, respectively. Therefore, the coordinate of the left top corner of the reference subband is
(2x, 2y) in the Eq. (3).
1 1
, , 1,0 0
, ,
( , ) | ( , ) (2 ,2 ) | (3)
( , ) arg min ( , ), , , 1 3
px py
M s M s M sk l
M s M s
SAD i j S x k y l K S x k i y l j
MV x y SAD i j p i j p s
− −
−= =
= + + − × + + + +
= − ≤ ≤ ≤ ≤
∑∑
Except the highest and the lowest level subbands, all subbands use MVs of the 1-level higher subbands
in the same orientation in Eq. (4). For instance, MVs of level m subbands are calculated by scaling MVs
of level m+1 subbands, and then the incremental motion vector, ∆(δx, δy) is calculated in the same
manner in Eq. (1) within a reduced search area. Note that this reduced search area is a part of 1-level
lower subband. This process repeats from the higher level subbands to the lower level subbands.
, 1,( , ) 2 ( , ) ( , ) (4)2 1
m s m sMV x y MV x y x ym M
δ δ+= + ∆
≤ ≤ −
3.3. Motion Compensation
IPC performs motion compensation for the LLM subband, same as the conventional block-matching
scheme and uses Eq. (5) for other subbands. In the Equation, Rm,s presents the motion compensated
6
residual signals of the subband s in the level m. These are the differences between the blocks of the
subband s in the level m and the best matching blocks in the 1-level lower subband.
, , 1, ,( ) (5)2 , 1 3
m s m s m s m sR S K S MVm M s
−= − ⋅
≤ ≤ ≤ ≤
Fig. 4 Encoding algorithm for Intra Prediction Coding scheme
4. Implementation and Simulation Results
4.1. Implementation
To evaluate the performance of our proposed scheme, we modified the Geoff Davis’s wavelet image
coder [12] to a motion compensated wavelet video coder. In the implemented video coder, each frame
except the first frame is encoded in either MRME or IPC according to its similarity to the previous frame.
In other words, the first frame is encoded as an I frame and the subsequent frames are encoded as motion
compensated frames by exploiting MAD (Mean Absolute Difference), an measure of difference between
Intra_Inter_ME( )
/* Input Parameters : c_frame : the wavelet transformed current frame, prev_frame : the reference frame,
max_level : the highest DWT level, block_numbers : the number of blocks in a subband.
Output Parameters: mc_frame : motion compensated frame, MV : the set of calculated MVs */
M = max_level ; /* The highest DWT level */
BN = block_numbers ; /* The number of blocks in a subband */
For (i = 0; i < BN; i++) /* ME/ME for LLM subband */
MV[M][0][i]= Do_LL_MEMC(i, c_frame, prev_frame) ; /* use of Eq. (2) */
For (j = 1; j <= 3; j++) /* for LHM, HLM, HHM subbands */
For (i = 0; i < BN; i++) /* performs motion estimation */
MV[M][j][i] = Intra_pred(M, j, i, c_frame) ; /* use of Eq. (3) */
For (k = M-1; k > 1; k--) /* for lower level (2 ~ M-1) subbands */
For (j = 1; j <= 3; j++)
For (i = 0; i < BN; i++)
MV[M][j][i] = Intra_pred(MV, k, j, i, c_frame) ; /* calculation of MVs : Eq. (4) */
mc_frame = Do_MC(MV, c_frame, prev_frame) ; /* Motion compensation all subbands except
LLM subband : Eq. (5) */
7
the two objects. In Fig. 5, the compression mode of the current frame is determined after motion
compensation of LLM subband. That is, if MAD of the two LLM subbands is greater than a given
threshold T, the encoder operates in the MRME mode to estimate MVs, else it operates in the IPC mode.
Here, because MAD is an average difference value between the coefficients of the neighboring LLM
subbands, as a motion changes rapidly in a video sequence, MAD becomes higher. Therefore, it is
reasonable to use the IPC scheme in a heavy motion video sequence by considering coding efficiency and
error propagation rate. Eq. (6) shows the definition of MAD between the neighboring LLM subbands
where H, V are the horizontal and the vertical size of the LLM subband and ( , )tMLL i j is a coefficient
value of the coordinate (i, j) in the LLM subband in a frame f(t). 1 1
1
0 0
1( ) ( ) | ( , ) ( , ) | (6)H V
t t tM M M
i j
MAD LL LL i j LL i jMN
− −−
= =
= −∑∑
After motion compensation, the implemented video coder performs quantization and VLC (Variable
Length Coding) as they are employed in the Geoff Davis’s wavelet image coder in Fig. 5.
Input Video
DWT
Coding Control
LLME
MC
MVs
IPC
Threshold
MRME
MC
MC
MVs
MVs
Frame Memory
Q VLC
Q-10
-+ Bit Stream
MVs
Controls
-
+
++
DWT : Discrete Wavelet TransformLLME : Motion Estimation for LL subbandMC : Motion CompensationIP : Intra Prediction CodingMRME : Multi-Resolution Motion EstimationQ : QuantizationVLC : Variable Length CodingMVs : Motion Vectors
Fig. 5 Block diagram for the implemented video coder
4.2. Simulation Results
The proposed scheme was tested on the different motion rate video sequences. Table. 1 shows the
parameters of tested video sequences and wavelet transform for experiments. As previously stated, there
is a big difference between the higher and lower level subbands in the coefficient values. So, we need to
complement this difference using a constant K as shown in Eq. (3). To select a proper constant value, we
compared MADs of various video sequences by varying the constant value. Consequently, we came to the
conclusion that the integer 3 is the best value even though there is a little difference among several video
8
sequences. The tested video sequences are the standard QCIF (176×144) sequences, whose original frame
rates are 30 frames/s. But, the frame rates are reduced to 10 frames/s to demonstrate the performance at
low bit-rate. And, we assume that the first four frames are protected from the errors to allow the decoder
to stabilize into a steady state before the errors are introduced.
<Table. 1> the parameters for simulation
Parameters Description
Format of tested video QCIF(176×144)
File name Claire.qcif Foreman.qcifVideo
sequences Motion rate Low High
Original video 20 fps Frame rate
Coded video 10 fps
Maximum DWT level 3
Filters for wavelet transform Antonini 7/9
Complement constant K 3
In the simulation, we used the average luminance PSNR (Peak Signal-to-Noise Ratio) as a video quality measure. The PSNR is used as a measure of image quality and defined by 2 210 log 255 / eσ where
2eσ is the mean-square error between the original image and the processed image. For the sake of
experimental convenience, we assume that a single block row in a subband is a synchronization group and
a corruption of a block causes other blocks’ corruption in the same group, such as GOB (group f block) in
the H.263 standard. To hide a corruption, the decoder employs a simple error concealment technique,
previous frame concealment. In this approach, the corrupted blocks are replaced by corresponding pixels
of the reference frame.
G B
p
q
1-p 1-q
Fig. 6 A two state Markov channel
In the experiment, Gilbert’s error model [13] and WCDMA network errors are employed to investigate
the performance of the proposed algorithm in the wireless network environment. Gilbert’s error model
was proposed to model the error characteristics of a wireless channel between two states as shown in Fig.
6. It is a two-state Markov chain with the states named Good and Bad. Every state is assigned a specific
constant bit error rate (BER), eG in the good state, eB in the bad state ( G Be e= ). The state transition
probability p is the probability that the next state is the bad state, given that the current state is good state
9
and q is the probability of the reverse direction. In the simulation, we fixed eG to 0.0001972644427729, eB
to 0.1 and q to 0.00078125 while we varied p from 10-4 to 10-1 to simulate the various network states.
First, We evaluated the performance of the two schemes, IPC and MRME, at different bit-error rate
ranging from 10-4 to 10-1 as shown in Fig.7 and Fig. 8. In the figures, the symbol T is a given threshold to
determine the motion estimation mode, IPC or MRME. And as the value of T goes down, the IPC ratio
goes up in a video sequence. Fig. 7 shows an average PSNR performance on the Foreman video sequence
at different ratio of IPC, 100%, 90%, 55%, 5% and 0% when T is 2, 5, 7, 10, and 15, respectively. As the
IPC ratio approaches 0%, the MRME ratio goes up to 100% on video sequence. It can be supposed that
the output bit-rates are different in accordance with the IPC ratio. As a result of the experiment, we
became aware of the fact that they are almost same on a heavy motion video sequence, such as Foreman.
In Fig.7, the compressed bit-rate is about 424Kbits/sec. Fig. 7 shows that as the proportion of IPC
increases, the PSNR performance becomes higher regardless of BER. This is because IPC suppresses
error propagation to the subsequent frames by referring to the same frame.
1E-4 1E-3 0.01 0.1
24.5
25.0
25.5
26.0
26.5
27.0
27.5
28.0
28.5
29.0
PS
NR
(dB
)
BER(Bit Error Rate)
T=2 T=5 T=7 T=10 T=15
Fig. 7 Comparison of the IPC and MRME on Foreman sequence at different bit-error rates
In Fig. 8, we tested on the Claire video sequence, which is a little motion video. In this experiment, the
given thresholds are smaller than those of Foreman, because the neighboring frames are very similar to
each other. The Claire video sequence is encoded at the rate of 100%, 75%, 30%, 10% and 0% when T is
0.5, 1, 1.5, 2, and 10, respectively. Contrary to the result of Foreman, the output bit-rate increases in
proportion to the IPC ratio on the Claire video sequence. In the figure, the bit-rate for the proposed
scheme is 599Kbits/sec, while that for the pure MRME is 573Kbits/sec. The proposed scheme requires
about 4% more bit-rate. This is due to the high coding efficiency of MRME, which performs motion
10
estimation from a very similar previous frame. In the figure, as the proportion of IPC increases, the PSNR
performance becomes higher, similarly to the result on Foreman. This state becomes more prominent as
the bit-error rate increases. This is because even a simple error concealment scheme is effective at a low
bit-error rate on the little motion video such as Claire but, as the bit-error rate increases, the error
concealment takes little effect.
1E-4 1E-3 0.01 0.136
37
38
39
40
41
42
43
44 T=0.5 T=1 T=1.5 T=2 T=10
PSN
R(d
B)
BER(Bit Error Rate)
Fig. 8 Comparison of the IPC and MRME on Claire sequence at different bit-error rates
0 5 10 15 20 25 30 35 40 45 50 55 6022
24
26
28
30
32
34
36
PS
NR
(dB
)
Frame Number
T=2 T=5 T=7 T=10 T=15
Fig. 9 Comparison of PSNR on Foreman image sequence from the fram 1 to the frame 60
11
0 5 10 15 20 25 30 35 40 45 50 55 6034
36
38
40
42
44
46
48
50 T=0.5 T=1 T=1.5 T=2 T=10
PSN
R(d
B)
Frame Number
Fig. 10 Comparison of PSNR on Claire video sequence from the fram 1 to the frame 60
Fig. 9 shows the PSNR performance of each frame at the rate of 10-3 BER on the Foreman video
sequence. From 2nd to 12th frame the pure MRME shows higher PSNR than any other elements, because
there is hardly motion change during initial period on Foreman. However, from the 13th frame the
situation is reversed itself. Especially, the curves of T=2 and T=5 show high PSNR, and the curve of the
pure MRME declines slowly as time goes by. This means that accumulated distortion sharply decreases
the performance of MRME as time passes since it always refers to a previous frame. However, the curves
of IPC when T=2 and T=5 show different results that PSNR rises and falls regardless of time passage.
This is due to the characteristics of IPC that refers to the current frame and not the previous frame.
Fig. 10 shows the results of experiment on the Claire video sequence in the same condition with Fig.9.
And the results are very similar to those of Fig.9 on the whole except that the curves of T =0.5 and T=1
show different results with each other.
The reconstructed 51st frames at various threshold values on Foreman are shown in Fig. 11. As the IPC
ratio goes up, the frames look clearer and smoother. On the contrary, at high MRME rate, the frames look
rough and noisy. Through this result we can see the effect of errors on the lower level subbands. In the
case of MRME, when a bit-error occurs in a subband, the error propagates to not only the lower level
subbands within the same frame but also the same and lower level subbands of the subsequent frames.
This causes the reconstructed image to look rough and noisy. However, in the case of IPC, since the lower
level subbands except LLM subband are not influenced by the errors of the previous frame, the
reconstructed image looks clear and smooth.
12
Fig. 11 Reconstructed real image at different threshold value on Foreman sequence
Fig. 12 Reconstructed real image at different threshold value on Foreman sequence
Fig. 12 shows the reconstructed 51st frames on Claire in the same condition with Fig.10. In the figure,
the visual quality of the IPC coded image also looks better than that of MRME, especially the woman’s
face. This is because that the probability of error propagation of IPC is lower than that of MRME, and
that MRME cannot properly recover errors on the face in motion using a simple error concealment
scheme.
In the next experiment, we investigate the performance of reconstructed images under simulated
T=2
T=7 T=10 T=15
Original Image T=5
Original Image T=0.5 T=1
T=1.5 T=2 T=10
13
CDMA network environment. The most important simulation parameters used to produce the error files
are listed below.
Table. 2 The simulation parameters for CDMA network
Parameters Values
Bit-rate 200Kbps ~ 1.7 Mbps
Carrier frequency 1.9GHz
Doppler frequency 5.3, 70, 211Hz
BER 10-4, 10-3
Channel type Vehicular A
Fig. 13 compares an average PSNR of Foreman at different threshold values when a Doppler frequency
is 70Hz, and BER is fixed to 10-4 and 10-3, respectively. The simulation results show that IPC is more
efficient in PSNR than MRME regardless of BER change. And, at a low bit-rate, the PSNR of T=5 and
T=7, which means the coded video sequence is composed of IPC and MRME scheme, outperforms the
MRME scheme. Besides, the simulations results of 5.3 and 211Hz of Doppler frequency is very similar.
0 200 400 600 800 1000 1200 1400 1600 180022
24
26
28
30
32
34
36
38
40
42
44
0 200 400 600 800 1000 1200 1400 1600 180023
24
25
26
27
28
29
30
31
32
33
34
35
36
37BER=10-3BER=10-4
PS
NR
(dB
)
Bitrate(Kbps)
T=2 T=5 T=7 T=10 T=15
PS
NR
(dB
)
Bitrate(Kbps)
T=2 T=5 T=7 T=10 T=15
Fig. 13 The comparison of PSNR of Foreman image sequence in the CDMA network environment
On the Claire video sequence, Fig. 14 shows the complex simulation results in PSNR. The PSNR of IPC
decreases as bit-rate becomes lower while that of MRME does not change so much. It can be inferred that
14
the used quantization scheme in the Geoff Davis’s image coder affects the performance of the proposed
scheme. That is, the lower level subbands are referred from the higher level subbands in the proposed
scheme, but the used quantization scheme quantizes the lower level subbands with larger step size.
Therefore, as the quantization level goes higher, the lower level subbands are quantized greatly. This
decreases sharply the PSNR of IPC particularly on little motion video such as Claire.
250 300 350 400 450 500 550 600 650 700
36
38
40
42
44
46
250 300 350 400 450 500 550 600 650 700
36
38
40
42
44
46
PS
NR
(dB
)
Bitrate(Kbps)
T=0.5 T=1 T=1.5 T=2 T=10
BER=10-3BER=10-4
PSN
R(d
B)
Bitrate(Kbps)
T=0.5 T=1 T=1.5 T=2 T=10
Fig. 14 The comparison of PSNR of Claire video sequence in the CDMA network environment
5. Conclusion
In this paper, we proposed a source coding scheme called Intra Prediction Coding scheme for a wavelet
based video to control error propagation. In the proposed scheme, only LLM subband in the current frame
estimates MVs from the reference frame and other subbands search MVs from the 1-level lower subbands
within the same frame. Therefore, if LLM subband is preserved from transmission errors, the errors are
protected from propagation to the subsequent frame. We implemented wavelet video coder by modifying
the Geoff Davis’s image coder and tested the performance of the proposed scheme using Gilbert’s model
and CDMA network model. It was demonstrated by the intensive computer simulation that the proposed
scheme provides better performance than MRME on heavy motion frame sequence while IPC
outperforms MRME at high bit-rate on a little motion frame sequence. It can be inferred that the used
quantization scheme affects the proposed scheme. So, we will design a new quantization scheme for our
proposed video coder to improve the coding efficiency in the near feature.
15
6. References
[1] “Video coding for low bitrate communication", International Telecommunications Union, Geneva,
Switzerland, ITU-T Recommendation H.263, 1998.
[2] R. Koenen, "Overview of MPEG-4 standard", in ISO/IEC JTC1/SC29/WG11 N4668,
http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm
[3] C. Christopoulos, A. Skodras, Touradj Ebrahimi, "The JPEG2000 still image coding system: An
Overview," IEEE Trans. Consumer Electronics, Vol. 46, No. 4, pp. 1103-1127, Nov., 2000.
[4] Y. Zhang and S. Zafar, "Motion-compenated wavelet transform coding for color video compression,"
IEEE Trans. Circuits Syst. Video Technol., Vol.2 pp.285-296, Sept. 1992.
[5] J. M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients," IEEE Transactions
on Signal Processing, Vol. 41, No. 12, pp.3445-3462, Dec. 1993.
[6] Amir Said, William A. Pearlman, "A New Fast and Efficient Image Codec Based on Set Partitioning
in Hierarchical Trees", IEEE Trans. CSVT, Vol. 6, No. 3, pp.243-250, June, 1996.
[7] S. Wenger, G.Knorr, J. Ott, F. Kossentini: "Error resilience support in H.263+,″ IEEE Trans. on
circuit and System for Video Technology, Vol. 8, No. 6, pp. 867-877, Nov. 1998.
[8] Y. Takashima, M. Wada, and H. Murakami, "Reversible variable length codes," IEEE Trans.
Communications., Vol. 43, pp. 158-162, Feb./Mar./Apr. 1995.
[9] AA Alatan, and M. Zhao, "Unequal error protection of SPIHT encoded image bitstreams," IEEE
JASC, Vol. 18, No. 6, pp. 814-818, June 2000.
[10] S. Cho and W. A. Pearlman,"Error Resilient Video Coding with Improved 3-D SPIHT and Error
Concealment," SPIE/IS&T Electronic Imaging 2003, Proceedings SPIE Vol. 5022, Jan. 2003.
[11] J. Zan, M.O. Ahmad, M.N.S Swamy, "New techniques for multi-resolution motion estimation,"
IEEE Trans. on Circuits and Systems for video technology, Vol. 12, No. 9 Sept. 2002.
[12] http://www.geoffdavis.net/
[13] J. Ebert, A. Willig. “A Gilbert-Elliot Bit Error Model and the Efficient Use in Packet Level
Simulation,” Technical Report, TKN-99-002, Technical University of Berlin, March 1999. [14] H.S. Kim and H.W. Park, “Wavelet-based moving-picture coding using shift-invariant motion
estimation in wavelet domain,” Signal Processing: Image Communication, Vol. 16, 669-679, April, 200l.
Session 4: Language and Computation Model
Problems with using MPI 1.1 and 2.0 as compilation targetsfor parallel language implementations
Dan Bonachea and Jason DuellComputer Science Division
University of California, BerkeleyBerkeley, California, USA
E-mail:[email protected], [email protected]
Keywords: MPI one-sided, parallel languages, global ad-dress space, RMA, one-sided communication.
Abstract
MPI support is nearly ubiquitous on high performancesystems today, and is generally highly tuned for perfor-mance. It would thus seem to offer a convenient “portablenetwork assembly language” to developers of parallel pro-gramming languages who wish to target different networkarchitectures. Unfortunately, neither the traditional MPI1.1 API, nor the newer MPI 2.0 extensions for one-sidedcommunication provide an adequate compilation target forglobal address space languages, and this is likely to be thecase for many other parallel languages as well. Simulatingone-sided communication under the MPI 1.1 API is too ex-pensive, while the MPI 2.0 one-sided API imposes a numberof significant restrictions on memory access patterns thatthat would need to be incorporated at the language level,as a compiler can not effectively hide them given currentconflict and alias detection algorithms.
1 Introduction
Global address-space (GAS) languages, such as UnifiedParallel C (UPC) [10], Titanium [31], and Co-Array For-tran [27], are an emerging class of languages that seek togive parallel application developers an alternative to the tra-ditional message passing model. By providing the illusionof a shared address space to a distributed program (regard-less of the underlying hardware), along with programmaticawareness and control of memory layout across processors,they promise to combine the performance of message pass-ing systems with the convenience of shared memory pro-gramming.
As researchers involved in developing portable imple-mentations of two of these languages (Berkeley UPC [7]
and Titanium [31]), we have invested (and continue to in-vest) substantial amounts of effort into porting our softwareto new network architectures and programming interfaces.
Many other scientists and hardware vendors, upon learn-ing this, have asked us why we do not simply write our lan-guage on top of MPI, to avoid this porting effort. As themost common parallel programming interface on contem-porary supercomputing platforms, MPI has typically beenhighly tuned for performance. It has also been described bysome of its developers as providing an “assembly languagefor parallel processing” [13]. Presumably, then, it oughtto provide an efficient, portable target for parallel languagecompilers.
Unfortunately for GAS language implementors, this isnot the case. This paper explains why both the tradi-tional MPI 1.1 two-sided interface [25] and the newer MPI2.0 [24] one-sided interface (hereafter referred to as MPI-RMA) are inadequate compilation targets for global addressspace languages. The two-sided communication model in-herent in MPI 1.1 cannot support the one-sided commu-nication requirements of GAS languages as efficiently aslower-level APIs. The newer one-sided MPI-RMA presentsan interface that, while undoubtedly of great use to manyapplication developers, presents a number of restrictions onmemory access patterns that would require conflict and aliasanalysis that is beyond the reach of current compilers. Thedifficulties presented by MPI-RMA as a compilation targetare not specific to GAS languages, and are likely to presentan obstacle to any parallel programming language whichdoes not expose application developers to MPI-RMA’s nu-merous restrictions on memory usage patterns.
1.1 Global Address Space languages and theircommunication requirements
The common characteristic of all GAS languages is thatthey provide programmers with a shared memory abstrac-tion between all processors in an application (regardless of
1
whether shared memory is actually supported natively inhardware), and that the programmer is given both knowl-edge and control of the layout of shared memory across pro-cessors. The shared memory model of GAS languages pro-vides the programmer with a familiar interface in which anypiece of shared memory can be accessed by any processor inan application via the normal data access mechanisms builtinto the language, i.e., by accessing an element of a sharedarray, dereferencing a pointer to shared data, or simply ref-erencing the name of a shared variable. Of course, on manyplatforms the access time for touching shared memory willdepend heavily on the location of the data, and so GAS lan-guages make the layout of shared memory explicit to theprogrammer, who is encouraged to write code that takes ad-vantage of locality in order to achieve higher performance.
This combination of a globally shared address spacewith locality information allows an incremental develop-ment model, in which serial or shared-memory codes caninitially be naıvely ported to a distributed environment, pro-filed to isolate bottlenecks, and then gradually tuned to op-timize performance. The compiler for a GAS languagemay also be able to hide some or all of the latency cost ofremote accesses by scheduling unrelated computation (ormore communication) during the interim imposed by net-work traffic.
GAS languages have some differences in how they allowshared memory to be allocated, and/or how they distinguishshared and non-shared data. UPC and Co-Array Fortran re-quire the programmer to explicitly declare data objects tobe either shared or private (usage varies by application, buttypically the major data structures reside in shared space).In Titanium, by contrast, all heap and static data can poten-tially be accessed remotely [14], although a sophisticatedcompiler escape analysis [19] can detect which objects arepotentially “shared” (interesting applications typically haveabout 50-100% of the total bytes allocated judged to beshared by the analysis). Dynamic allocation of shared mem-ory is required to be collective in Co-Array Fortran, while itcan also be achieved in UPC and Titanium through purelylocal, non-collective operations. UPC additionally allowsshared objects to be allocated on remote processes usinga non-collective operation (upcglobal alloc()) with no ex-plicit cooperation from the process allocating the data.
Remote data access in GAS languages naturally leads toa one-sided communication pattern. In part, this is a mat-ter of providing a familiar and convenient interface to pro-grammers who are used to a shared-memory programmingmodel, but there is also an important class of dynamic andirregular applications that we wish to support which havecommunication patterns that are data-dependent and there-fore statically unpredictable, and hence are most naturallyexpressed using one-sided operations.
In GAS languages, the ability to predict data access pat-
terns is further limited by the fact that all three of these GASlanguages allow accesses to shared objects with affinity tothe calling thread via “local” pointers, and these accessesare indistinguishable from accesses to purely local (private)objects, i.e., the languages’ semantics provide no explicitinformation about whether the memory being accessed ispotentially shared or not. In each language, accesses toshared data which reside locally using “local” pointers gen-erally provide significantly better performance than accessthrough “global” pointers. In fact, the performance impactis so dramatic that most UPC programmers specifically op-timize for this case, and the Titanium compiler includes aspecialized analysis [18] to automatically infer when such atransformation is provably legal.
A result of these data dependent access patterns andshared/local aliasing is that compilers for GAS languagesgenerally have no way to know a priori which specificshared memory locations will be accessed remotely, orwhen they will be accessed. They thus have no way to stat-ically check, for instance, if data accesses from differentprocessors will conflict at runtime. This is exactly analo-gous to the problem of static alias analysis in sequential,pointer-based languages, but is complicated by the exis-tence of multiple independent threads of control – for in-depth discussion, see [16, 28, 23].
Given these properties of GAS languages, they presentthe following requirements for any underlying network in-terface:
• The ability to perform (or at least simulate) one-sidedcommunication (i.e., where only the initiator is explic-itly involved and the communication proceeds inde-pendently of any action taken by the remote threads).While one-sided communication can be simulated bythe language runtime using an underlying two-sidedcommunication API, neither the user nor the compilermay be exposed to any aspect of implementing mes-sage receipts (or other bookkeeping) on the remoteside of an access.
• Good latency performance for small remote accesses.Small messages are commonly used in initial imple-mentations of many GAS programs, and may be un-avoidable even in well-tuned applications in certainproblem domains. The performance of applicationswritten in global-address space languages is thus oftenvery sensitive to network round-trip latency.
• The ability for the compiler to hide network laten-cies by overlapping communication with computationand/or other communication, through the use of non-blocking remote accesses. This requires that the soft-ware overhead involved in network traffic be less thanthe latency of messages (and/or the gap between how
2
often the network interface will accept overlappingmessages), as overlapping is impossible if the hostCPU is busy throughout a messaging operation.
• Support for arbitrary access patterns to shared data.This includes allowing concurrent remote accesses tothe same regions of shared memory from different re-mote processors, as well as support for allowing shareddata to be accessed at arbitrary points in the programvia local pointers.
• Support for using or implementing collective commu-nication and synchronization operations.
The rest of this paper shows that neither the MPI 1.1 orthe newer MPI-RMA interface supports all of these charac-teristics adequately.
2 MPI 1.1 as a compilation target for GASlanguages
The most widely available and portable software inter-face for programming distributed memory machines to-day is that provided by the MPI 1.1 specification, whichhas been implemented and carefully tuned on most con-temporary high-performance parallel systems. As such, itpresents a very tempting compilation target for GAS lan-guages, which could potentially run on all such systems bysimply providing a single networking layer that runs on topof MPI.
Communication under MPI 1.1 is strictly two-sided: alltraffic takes the form of message sends, which requirematching receive operations to be explicitly issued on thereceiving side. While this does not neatly match the GASmodel of one-sided communication, it does not disqualifyMPI 1.1 from use by GAS languages. Our group has imple-mented an MPI layer [1] which transparently handles mes-sage receipt in the runtime by periodically polling for newmessage receipts, thus providing the illusion of a one-sidedinterface to both the user and the compiler. We have usedthis MPI layer on a number of platforms, including the IBMSP, the Cray T3E, the SGI Origin 2000, Linux and CompaqAlphaserver systems using Quadrics network hardware, andLinux clusters using Myrinet, Dolphin/SCI, and Ethernetnetworks. This portable MPI layer has proven to be an in-valuable prototyping tool for quickly deploying our systemson new architectures.
However, as Figure1 shows, use of MPI 1.1 implemen-tations does not come without a cost. This microbench-mark data gathered by our group (and described in detailin [4]), demonstrates that the latency and/or software over-head associated with small message sends in MPI is typ-ically much higher than when using native network APIs.This difference is most pronounced on the lowest-latency
systems which are the most likely targets for GAS appli-cations, with more than a factor of five difference betweenMPI performance and that of some native machine APIs.
It should be noted that the MPI data in Figure1 weregathered using nonblocking send and receive calls, and thatthese calls tend to incur higher overhead than their block-ing counterparts in most MPI implementations [4]. It isunclear whether these nonblocking overheads are unavoid-able, or are simply the result of less tuning by vendors:most MPI applications use blocking functions, and ven-dor benchmarks generally report only numbers for blockingsend/recv calls. Use of at least some nonblocking calls isnecessary when simulating one-sided communication witha two-sided API, in order to prevent deadlock (which couldotherwise be caused trivially by two processors sending amessage to each other at the same time). Nonblocking callsare also desirable in a GAS language context, as compilersmay be able to overlap computation (and/or other networkcalls) within the interval where the CPU would otherwisebe blocked.
To more easily support porting our UPC and Titaniumimplementations, we use a portable, high-performance net-working library called GASNet [5]. This is the layer whichprovides the one-sided messaging abstraction over MPI 1.1.Besides MPI 1.1, GASNet has also been implemented di-rectly over the native APIs of a number of high-performancenetworks (the IBM SP’s LAPI [17], Myrinet’s GM [12],Quadrics’ elan [11], and Infiniband [15]). UPC and Tita-nium applications can switch the underlying network usedwith a simple recompilation, and this gives us the opportu-nity to directly compare the performance of a single appli-cation run using MPI 1.1 for communication with its tim-ing when run over a lower-level network API. Figures2,3, and5 show the difference in performance between a setof UPC applications running on an Compaq Alphaserversystem with a Quadrics interconnect: a naıve and a bulk-synchronous version of the NAS Parallel Benchmarks [2]Conjugate Gradient, and a bulk-synchronous implementa-tion of the NAS Multigrid benchmark. Figures4 and6 showthe same bulk CG and MG codes running on an x86-LinuxMyrinet cluster. Finally, Figure7 shows the performanceof the MG benchmark on an IBM SP Power3. The appli-cations were compiled with the Berkeley UPC Compilerv1.0beta [7] to use either MPI-1.1 or the lower-level vendor-specific networking layer for communication. For compari-son purposes, on the Compaq system we also show the per-formance for the same applications when compiled with the2.1-003 version of the Compaq UPC compiler [8], whichcompiles executables to use the vendor-specific elan API(data for Compaq UPC is omitted for higher numbers ofnodes due to a recently-discovered performance bug whichis still pending). In all cases, the network and system hard-ware being used is identical–the only difference is the run-
3
Figure 1. Send and receive software overheads ( os and or) superimposed on the one-way end-to-endlatency ( EEL) for 8-byte messages on various high-performance systems. For MPI on the T3E andMyrinet, the sum of the overheads is greater then EEL, and so os = S + V and or = R + V . For theother configurations os = S and or = R.
IyneSPtayneBulk
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 5 10 15 20 25 30 35nodeM
FLOePSMeg)
Berueley-OpLCoOpaq-elanBerueley-elan
Figure 2. Performance for a na ıve (fine-grained) UPC implementation of NAS Conjugate Gradientusing different network APIs/compilers on a Compaq AlphaServer
4
IyneSPtaBulkakyP-sc-
0
.00
200
400
600
8000
8.00
8200
0 1 80 81 .0 .1 50 513node
MFLOPS
gd)Bdrduldry3-npCyaldry3gd)BdrdulpCq
Figure 3. Performance for a bulk-synchronous UPC implementation of NAS Conjugate Gradient usingdifferent network APIs/compilers on a Compaq AlphaServer
IyneSPtaBulk-sySchnoSousa23
0
.0
200
2.0
400
4.0
600
6.0
0 4 8 1 5 20 24 28 21 253node
MFLOPS
g)Brn3ouly)-lBrn3ouly
Figure 4. Performance for a bulk-synchronous UPC implementation of NAS Conjugate Gradient usingdifferent network APIs with the Berkeley UPC compiler over Myrinet/GM
5
IyneSPtaBulkakyP-sc
0
.000
2000
4000
6000
80000
8.000
82000
0 1 80 81 .0 .1 50 513node
MFLOPS
gd)Bdrduldry3-npCyaldry3gd)BdrdulpCq
Figure 5. Performance for a bulk-synchronous UPC implementation of NAS Multigrid using differentnetworks APIs/compilers on a Compaq AlphaServer
IyneSPtaBulk-sySchnoSousaI2
0
.00
200
400
600
8000
8.00
8200
8400
0 1 80 81 .0 .1 50 513node
MFLOPS
g)Brn3ouly)-lBrn3ouly
Figure 6. Performance for a bulk-synchronous UPC implementation of NAS Multigrid using differentnetwork APIs with the Berkeley UPC compiler over Myrinet/GM
6
IyneSPeytaBulk-scho-otlen2
0
.00
2000
2.00
4000
4.00
0 . 20 2. 40 4. 60 6.8153n
odeMFL
OPSg)B185rgulSg)B185rgu
Figure 7. Performance for a bulk-synchronous UPC implementation of NAS Multigrid using differentnetwork APIs with the Berkeley UPC compiler on an IBM SP
time system and communication software in use.
The results show that the MPI 1.1-based GASNet layeris significantly outperformed by those based directly overone-sided networking APIs. The communication costs ofthe bulk-synchronous CG code are dominated by bulk putoperations (avg 56KB each) and some small (8 byte) gets,whereas the MG code is dominated by bulk get operations(avg 100KB each) and some small (8 byte) gets. For thetwo bulk synchronous benchmarks, the performance gap isless a function of any inherent difference between the MPIbandwidth and that of the lower-level API (the two tend tobe similar on each system), than of the cost of providing theillusion of one-sided communication under MPI 1.1.
Currently, our GASNet-MPI implementation works byhaving each host post a set of nonblocking receive calls inadvance, with varying buffer sizes, so that a buffer of thecorrect size is generally immediately available for any in-coming message. This allows a remote processor to com-plete a message send before the target processor may evenbe aware that the message has arrived. This buffering strat-egy requires that the message be copied twice (once on thesource node, to add a header specifying the destination ad-dress, and once at the target node, to move the data from thebuffer to the final destination).
The GASNet-MPI implementation could probably pro-vide better performance for large messages if it used theinitial receive buffer to inform the target CPU to set up areceive call that placed the data directly into the destination
memory, thus avoiding buffering. However, this would stillbe inherently slower than the elan-based direct RDMA ver-sion: the transaction would incur an additional rendezvous(in the form of a set of MPI messages) to set up the directreceive call (and likely an additional network round trip asMPI did its own internal rendezvous), followed by the directRMA transfer, and then a final MPI message from the des-tination processor informing the initiator that the messagehad completed (MPI does not provide built-in notification ofremote completion, but it is required for GAS language se-mantics, which sometimes need to guarantee target comple-tion for remote writes). The transaction would also need towait initially for the remote processor to poll the network tosee the initial setup message. For messages of a sufficientlylarge size, however, these costs ought to be amortized to thepoint where the MPI layer’s peak bandwidth performancemight asymptotically approach that of the elan layer (whenthe remote node is well-attentive to the network). It unclearwhat the crossover message size would be for this sort ofalgorithm, however, and it would need to be tuned for eachdifferent type of network (and possibly on a per-machinebasis), thus lessening the portability benefit of using MPI.
The performance of the naıve CG benchmark (whichuses an average message size of only 8 bytes) is less am-biguous. For applications which use small messages, elanprovides a significant speedup, allowing the application torun more than four times faster. Clearly the communica-tion performance obtained by directly targeting the native
7
network API is more suitable than a solution layered onMPI for supporting an incremental programming model forparallel applications and for inherently fine-grained appli-cations.
It is unclear if MPI 1.1 will ever be able to provide GASlanguages with the same performance as native networkAPIs for small and medium-sized one-sided messages. Thepopularity of MPI has meant that vendors are willing to ex-pend considerable effort tuning their implementations, andcertain trends are promising, such as the offloading of someof MPI’s message-passing logic onto dedicated network co-processors. Although such co-processors are not likely toimprove end-to-end message latency, they might lower thehost CPU’s software overheads to levels comparable to thatof lower-level network APIs, which might in turn allow aUPC implementation to hide the latency by performing un-related work in the interim.
3 MPI-RMA as a compilation target for GASlanguages
The most recent version of the MPI specification (MPI2.0) extends MPI 1.1 with a number of new features. In par-ticular, Chapter 6 of the specification [24] adds support for“one-sided communications” (also called “Remote Mem-ory Access (RMA)”). This extension supplements the tra-ditional two-sided MPI communication model with a one-sided interface that can take advantage of the capabilities ofRMA network hardware, in the hopes of lowering latencyand software overhead costs for applications written usinga shared-memory-like paradigm.
On the surface MPI-RMA thus seems like a natural fitfor GAS languages. However, the rest of this document ex-plains why MPI-RMA unfortunately does not meet the re-quirements laid out in section1.1 for implementing globaladdress space languages. Essentially, the strong restrictionsplaced on memory access patterns by the API and the weak-ness of its semantic guarantees make it unusable for GASlanguage implementation purposes. It is our sincere hopethat the observations presented here regarding the problem-atic semantic aspects of the interface can serve as a guideduring future revisions of the MPI-RMA specification – insection5 we provide some initial ideas for improving thesemantics to better meet the needs of the GAS languagecommunity.
It is important to note that we are interested in writingportable code, and are therefore not concerned with the be-havior of particular implementations of MPI-RMA (whichmay happen to relax some of these constraints, and/or havewell-defined semantics for conditions the specification la-bels as erroneous), but rather with the guarantees providedby any MPI-RMA implementation (including one whichaggressively exploits the intentionally under-specified as-
pects of the API).
3.1 Basics of the MPI-RMA API
The semantics of MPI-RMA are fairly complex. Herewe present only an overview of its usage, then proceed todiscuss those aspects of the API that affect GAS languageimplementations. The reader may consult the MPI 2.0 spec-ification [24] for full details – the subsequent discussion in-cludes references to the sections in the specification relevantto each major semantic issue.
The MPI-RMA API revolves around the use of abstractobjects called “windows” which intuitively specify regionsof a process’s memory which have been made availablefor remote operations by other MPI processes. Windowsare created using a collective operation (MPIWin create)called by all processes within a “communicator” (a group ofprocesses in MPI terminology), which specifies a base ad-dress and length (which may be different on each process),and is permitted to span very large areas (e.g., the entirevirtual address space). All three one-sided RMA operations(MPI Put, MPI Get, MPI Accumulate) take a reference tosuch a window, an offset into the window, and a rank in-teger to indicate which process is the remote target. Allone-sided operations are implicitly non-blocking and mustbe synchronized using one of the synchronization methodsdescribed below.
3.1.1 Active Target vs. Passive Target
There are 2 primary “modes” in which the one-sided APIcan be used, named “active target” and “passive target”. Theprimary semantic distinction is whether or not cooperationis required by the remote node in order to complete a remotememory access. All RMA operations on a window musttake place within a synchronization “epoch” (with a startand end point defined by explicit synchronization calls), andoperations are not guaranteed to be complete until the endof such an epoch. The active and passive target modes differin which process makes these synchronization calls.
Active target operation requires synchronization func-tions to be called on both the origin process (the one makingthe RMA get/put accesses) and the target process (the onehosting the memory in the referenced window). The ori-gin process calls MPIWin start/MPIWin complete to be-gin/end the synchronization epoch, and the target processmust cooperate by calling MPIWin post/MPIWin wait toacknowledge the beginning/end of the epoch (there is alsoa collective MPIWin fence operation which can be sub-stituted for one or more of these calls). In any case, thisrequired cooperation effectively destroys the possibility ofimplementing the truly one-sided operations that we wish toprovide in GAS languages using active-target mode RMA.
8
Passive target operation provides more lenient syn-chronization. In passive-target operation, only theoriginating process calls synchronization functions(MPI Win lock/MPI Win unlock) to start/end the accessepoch. As with active target, all RMA accesses must takeplace within such an epoch and are not guaranteed tocomplete until the MPIWin unlock call completes. Thereare two forms of MPIWin lock–shared and exclusive.MPI Win lock(exclusive) enforces mutual exclusion onthe window and the RMA operations performed withinthe epoch–i.e., it conceptually blocks until it can startan exclusive access epoch to the window, and no otherprocesses may enter a shared or exclusive access epoch forthat window until the process with exclusive access unlocks(the semantics are actually slightly weaker than this, butthe intuition is correct). MPIWin lock(shared) allowsother concurrent shared epochs from other processes. Thespecification recommends the use of exclusive epochswhen executing any local or RMA update operationson the memory encompassed by the window to ensurewell-defined semantics (section 6.4.3, p.131).
3.2 Restrictions on the Use of Passive Target RMA
The interface described thus far for passive target RMAseems reasonable, however unfortunately there are a largenumber of restrictions on how it may be legally used. Hereare some of the most important restrictions:
1. Window creation is a collective operation–all pro-cesses which intend to use a window for RMA (includ-ing all intended origin and target processes) must par-ticipate in the creation of that window (section 6.2.1,p.110).
2. Implementors may restrict the use of passive-targetRMA operations to only work on memory allocatedusing the “special” memory allocator MPIAlloc mem(section 6.4.3, p. 131). This prevents the useof passive-target RMA on static data and forces allglobally-visible objects to be allocated using this “spe-cial” allocation call (no guarantees are made about howmuch memory can be allocated using this call, andsome implementations may restrict it to a small num-ber of pinnable pages).
3. It is erroneous to have concurrent conflicting RMAget/put (or local load/store) accesses to the same mem-ory location (section 6.3, p.113).
4. The memory spanned by a window may not concur-rently be updated by a remote RMA operation anda local store operation (i.e., within a single accessepoch), even if these two updates access different (i.e.,non-overlapping) locations in the window (section 6.3,p.113).
5. Multiple windows are permitted to include overlap-ping memory regions, however it is erroneous to useconcurrent operations to distinct overlapping windows(section 6.2.1, p. 111).
6. RMA operations on a given window are only permit-ted to access the memory of single process during anaccess epoch (section 6.4.3, p.131)
3.3 Implications of the MPI 2.0 semantics
We now investigate the implications of the above restric-tions on our effort to implement remote accesses in a GASlanguage.
Restriction #1 implies that a GAS language implemen-tation cannot use a separate window per shared object,because that would require a collective operation for theallocation of shared objects, and shared object alloca-tion in GAS languages is often required to be a purelynon-collective, local operation. In order to support non-collective allocation, many or all shared objects would needto be coalesced within a single window. But while arbitrar-ily large regions (like the entire virtual address space) canbe mapped within a single window, this approach wouldseverely limit concurrency due to restriction #4. The sameshared object can be mapped into several windows, but dueto restriction #5 this approach is probably not useful.
Restriction #2 implies that all potentially shared objectsmust be allocated using MPIAlloc mem(). This restrictionalone may be enough to make MPI-RMA unsuitable formany GAS languages, unless MPI implementations permitlarge amounts of memory to be allocated using this func-tion (recall that a significant fraction of all data allocated byGAS programs is typically accessed remotely at some pointin program execution).
Restriction #6 means that an implementation wouldprobably need at least one separate window for each targetprocess, as otherwise RMA operations from one process todifferent target processes would be unnecessarily serialized.This may present scalability problems for applications us-ing a large number of nodes, depending on how windowsare implemented.
The underlying concept addressed by restriction #3 isfundamental to shared-memory programming and nothingnew. GAS languages generally specify that conflicting ac-cesses to a single memory location will store an undefinedresult to the location (for conflicting writes) or return an un-defined value (for the read in a read-write conflict). How-ever, the MPI restriction is unfortunately much stronger–itsays such conflicting accesses areerroneous, which impliesthat any resulting behavior after such a violation is possible(e.g., the MPI implementation would be within its rights toconsider this a fatal error). Unfortunately, it is not feasibleto statically detect all such conflicting accesses (from differ-ent processes) in user-provided code without application-
9
specific information (or perhaps even with it). Using agreat deal of compiler analysis, a conservative superset ofthe conflicting accesses could be generated, but in a weaklytyped language such as UPC, this is likely to include most ofthe accesses. It is impossible to detect such conflicts at run-time without global communication. In truth, the compileranalysis required simply to decide that it is safe for a singleprocess to include any other RMA accesses within the sameepoch as an RMA put or accumulate is non-trivial, althoughthis problem is an issue of sequential alias analysis which isconsiderably better understood. MPIWin lock(exclusive)can be used to conservatively prevent concurrent conflictingaccesses from distinct processes (by wrapping every RMAPut within its own exclusive epoch). However, because thislocks the entire window and serializes all access epochs tothat window, this would drastically reduce the concurrencyof accesses to distinct (i.e., non-conflicting) memory loca-tions that happen to reside within the same window (whichwould be very bad if the entire shared memory resided ina single window). This also effectively nullifies the abilityto perform non-blocking puts. Another option which mighthelp is to replace all RMA puts with RMA accumulate oper-ations where the reduction operation is “MPIREPLACE”–this has the same semantics as an RMA put, but conflictingaccumulate operations have well-defined semantics (theybehave as if the conflicting accumulates happened in someserial order)–however, conflicting RMA gets and local loadsto the same data would still be erroneous within the accessepoch, so an untenable amount of conflict analysis wouldstill be required.
Restriction #4 is particularly onerous for GAS languageimplementations. It prevents the local process from mak-ing any changes to memory that lies within a window dur-ing a remote access epoch to that window, even to differ-ent (i.e., non-conflicting) memory locations. This impliessome form of synchronization between the origin and tar-get process when accessing these locations, which impliestruly one-sided communication is not possible. The MPIspec recommends the local process perform all updates tothe local memory which falls within a window inside anexclusive epoch on that window. Because every access toa language-level local pointer in UPC and Titanium is po-tentially an access to a shared memory location residing lo-cally, in the absence of other informationeverylocal storeoperation would need to be wrapped within an exclusive ac-cess epoch (possibly prefixed with a check of whether theaccessed location resides in a window). Similarly, everylocal load which could potentially conflict with a remoteRMA put or accumulate to that location would need to bewrapped within a shared access epoch. The performanceimplications of adding such overheads to what should besimple local memory load/store instructions are staggering.Note that it is not sufficient for the local process to simply
maintain a permanent epoch on its own window, becausethis would prevent remote RMA operations on that windowfrom making progress. Finally, restriction #4 also impliesthat the entire virtual address space of a process can not beincluded in a window, because this would include privatememory that is constantly changing, such as the programstack.
The combination of these effects makes the MPI-RMAeffectively unusable by GAS language implementations: itis extremely unlikely that a layer could be written on top ofMPI-RMA that would provide an efficient implementationof GAS language communication semantics. Given that ad-herence to some of the MPI-RMA restrictions involve suchdifficult compiler problems as conflict and alias analysis, itit likely that many other parallel languages would also findMPI-RMA a difficult interface to target. It seems likely thatonly languages that effectively expose most of MPI-RMA’srestrictions and programming disciplines to the user at thelanguage level would find MPI-RMA a useful compilationtarget – however, users are unlikely to tolerate such rigidusability restrictions in any parallel language that exposes ashared-memory-like paradigm.
4 Related Work
A number of papers on MPI-RMA performance areavailable. Luecke and Hu [20] evaluated MPI-RMA perfor-mance on the Cray SV1, and Luecke et al. [21] comparedMPI-RMA performance to that of the vendor-suppliedSHMEM libraries on the Cray T3E and SGI Origin 2000 ar-chitectures, finding that the SHMEM interface significantlyoutperformed the MPI-RMA implementation available atthe time. Traff et al. [30] examined the performance ofMPI-RMA on NEC SX-5 system, and found its data trans-fer performance to be similar to that of the MPI 1.0 two-sided API. Matthey and Hansen [22] report that a produc-tion molecular dynamics simulation code ran 10-70% fasteron an Origin 2000 system after being converted from MPI-1to MPI-RMA.
Our strategy of implementing one-sided, asynchronousmessaging on top of MPI 1.1 is not a new one. Dobbelaere[9] reports implementing a custom one-sided communica-tion layer over MPI 1, and Booth and Mourao [6] discussimplementating the MPI-RMA API itself over MPI 1.1.
It appears from Smith [29] that the MPI-RMA pioneersdecided early on to require a separate set of allocationfunctions to allocate data that can be accessed by passive-target one-sided operations. For an alternative approachthat allows the entire virtual memory space to be accessedby RDMA (even for the tricky case of pinning-based net-works such as Myrinet), see the description of lazy registra-tion/pinning methods in [3].
Gropp [13] reviews some of the reasons for MPI’s en-during success in the parallel computing community.
10
5 Conclusion
We have shown that despite having certain desirablecharacteristics, neither the MPI 1.1 interface nor the morerecent MPI 2.0 RMA extensions make an attractive com-pilation target for Global Address Space languages. MPI1.1, despite its wide availability and often highly-tuned per-formance, imposes too many overheads with its two-sidedmessaging paradigm for GAS languages, particularly thosein which small message performance is important. Thenewer MPI-RMA API imposes too many semantic restric-tions to be a useful portable compilation target, at leastfor parallel languages which allow aliasing, data conflicts,and/or the illusion of a single, arbitrarily accessible sharedaddress space.
The fact that MPI-RMA makes a poor compilation tar-get for at least certain classes of parallel languages does notmean that the API is not useful to application programmers,who are capable of globally structuring their application toabide by the specification’s semantic restrictions. The pur-pose of this paper, in short, is not to malign MPI-RMA, butrather to explain why it is not suitable as a portable commu-nication layer for implementing parallel languages such asTitanium, UPC and Co-Array Fortran.
We sincerely hope that the MPI-RMA interface can berevised in future versions of the MPI specification to relaxsome of the problematic semantic restrictions described inthis paper. Perhaps with some careful changes made to ac-commodate the requirements of parallel high-performancelanguage implementors, it would become more useful as aportable compilation target for these languages. One pos-sibile approach is to introduce a third mode of operation inMPI-RMA to complement active-mode and passive-modethat discards the current problematic semantic restrictionsand provides a high-performance RMA interface similarto the one in GASNet [5] or ARMCI [26]. Our experi-ence with these layers indicates such an interface couldbe implemented efficiently on a wide range of communi-cation networks and would successfully expose the high-performance capabilities of the network hardware withoutcrippling the client with unreasonable semantic restrictions.However, any remaining portability concerns could easilybe addressed by making the new interface an optional com-ponent that need not be provided by all MPI implementa-tions. We believe the inclusion of such an option within afuture revision of the MPI specification would largely solvethe portability issues in implementing GAS languages andencourage communication innovation by allowing networkvendors to more usefully expose the one-sided RMA capa-bilities of their hardware.
References
[1] AMMPI home page. http://www.cs.berkeley.edu/∼bonachea/ammpi.
[2] D. Bailey, E. Barszc, J. Barton, D. Browning, R. Carter,L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson,T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan,and S. Weeratunga. The NAS parallel benchmarks. TechReport RNR-94-007, RNR, March 1994.http://www.nas.nasa.gov/Software/NPB/.
[3] C. Bell and D. Bonachea. A new DMA registration strat-egy for pinning-based high performance networks. InWork-shop Communication Architecture for Clusters (CAC03) ofIPDPS’03, Nice, France, 2002.
[4] C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Hus-bands, C. Iancu, M. Welcome, and K. Yelick. An evalua-tion of current high-performance networks. InIPDPS 2003,2003.http://upc.lbl.gov.
[5] D. Bonachea. GASNet specification, v1.1. Tech ReportUCB/CSD-02-1207, U.C. Berkeley, October 2002.http://www.cs.berkeley.edu/∼bonachea/gasnet.
[6] S. Booth and F. E. Mourao. Single sided MPI implementa-tions for SUN MPI. InSupercomputing, 2000.
[7] W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, andK. Yelick. A performance analysis of the Berkeley UPCcompiler. In17th Annual International Conference on Su-percomputing (ICS), 2003.http://upc.lbl.gov.
[8] Compaq UPC compiler. http://www.tru64unix.compaq.com/upc/.
[9] J. Dobbelaere and N. Chrisochoides. One-sidedcommunication over MPI-1. http://citeseer.nj.nec.com/dobbelaere01onesided.html.
[10] T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPCspecification, v1.1, March 2003.http://upc.gwu.edu.
[11] Elan programmer’s manual.http://www.quadrics.com.[12] GM reference manual. http://www.myri.com/scs/GM/doc/
refman.pdf.[13] W. D. Gropp. Learning from the success of MPI. InInterna-
tional Conference on High Performance Computing (HiPC2001), August 2001.
[14] P. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Lib-lit, G. Pike, and K. Yelick. Titanium language referencemanual. Tech Report UCB/CSD-01-1163, U.C. Berkeley,November 2001.
[15] Infiniband trade association home page. http://www.infinibandta.org.
[16] A. Krishnamurthy and K. Yelick. Analyses and optimiza-tions for shared address space programs.Journal of Paralleland Distributed Computing, 38(2):130–144, 1996.
[17] Using the LAPI. http://www.research.ibm.com/actc/OptLib/LAPI Using.htm.
[18] B. Liblit and A. Aiken. Type systems for distributed datastructures. InConference Record of POPL ’00: The 27thACM SIGPLAN-SIGACT Symposium on Principles of Pro-gramming Languages, Boston, Massachusetts, 19–21 Jan.2000.
[19] B. Liblit, A. Aiken, and K. Yelick. Type systems for dis-tributed data sharing. InSAS ’03: The 10th InternationalStatic Analysis Symposium, Lecture Notes in Computer Sci-ence, San Diego, California, June 11–13 2003. Springer-Verlag.
11
[20] G. R. Luecke and W. Hu. Evaluating the performance ofMPI-2 one-sided routines on a Cray SV1. Technical report,December 21, 2002.
[21] G. R. Luecke, S. Spanoyannis, and M. Kraeva. The perfor-mance and scalability of SHMEM and MPI-2 one-sided rou-tines on a SGI Origin 2000 and a Cray T3E-600. InPEMCS,December 2002.
[22] T. Matthey and J. Hansen. Evaluation of MPI’s one-sidedcommunication mechanism for short-range molecular dy-namics on the Origin 2000. InProceedings of PARA2000,The Fifth International Workshop on Applied Parallel Com-puting, 2000.
[23] S. P. Midkiff and D. A. Padua. Issues in the optimization ofparallel programs. InProceedings of the 1990 InternationalConference on Parallel Processing, 1990.
[24] MPI Forum. MPI–2: a message-passing interface standard.International Journal of High Performance Computing Ap-plications, 12:1–299, 1998. http://www.mpi-forum.org/docs/mpi-20.ps.
[25] MPI Forum. MPI: A message-passing interface standard,v1.1. Technical report, University of Tennessee, Knoxville,June 12, 1995.http://www.mpi-forum.org/docs/mpi-11.ps.
[26] J. Nieplocha and B. Carpenter. ARMCI: A portable re-mote memory copy library for distributed array libraries andcompiler run-time systems. InProc. RTSPP IPPS/SDP’99,1999.
[27] R. Numrich and J. Reid. Co-Array Fortran for parallel pro-gramming. InACM Fortran Forum 17, 2, 1-31., 1998.
[28] D. Shasha and M. Snir. Efficient and correct execution ofparallel programs that share memory. InACM Transactionson Programming Languages and Systems, 10(2):282–312,April 1988.
[29] A. G. Smith. Using MPI 2 one sided communications onCray T3D. Tech report, EPCC: The University of Edin-burgh, December 1995.http://citeseer.nj.nec.com/88176.html.
[30] J. Traff, H. Ritzdorf, and R. Hempel. The implementationof MPI–2 one-sided communication for the NEC SX. InProceedings of Supercomputing, 2000.
[31] K. Yelick, L. Semenzato, G. Pike, et al. Titanium: A high-performance Java dialect. ACM 1998 Workshop on Javafor High-Performance Network Computing, February 1998.http://titanium.cs.berkeley.edu.
12
Fast Parallel Solution For Set-Packing and Clique Problems by DNA-based Computing
Michael (Shan-Hui) Ho 1, Weng-Long Chang2, Minyi Guo3,
and Laurence Tianruo Yang4
1Department of Information Management Southern Taiwan University of Technology, Tainan County, Taiwan 710, R.O.C.
Phone Number: +886-6-2533131 Ext. 4300 E-mail: [email protected], [email protected]
Fax Number: +886−6−2541621
2Department of Information Management Southern Taiwan University of Technology, Tainan County, Taiwan 710, R.O.C.
Phone Number: +886-6-2533131 Ext. 4300 E-mail: [email protected], [email protected]
Fax Number: +886−6−2541621
Contact Author: Minyi Guo 3Department of Computer Software,
The University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan Phone Number: 0242-37-2557
E-mail: [email protected] Fax Number: 0242-37-2744
4Department of Computer Science,
St. Francis Xavier University P.O. Box 5000, Antigonish, B2G 2W5, NS, Canada
E-mail: [email protected]
Abstract
This paper shows how to use sticker to construct solution space of DNA for the library sequences in the set-packing problem and the clique problem. Then, with biological operations, we propose DNA-based algorithms to remove illegal solutions and to find legal solutions for the set-packing and clique problems from the solution space of sticker. Any NP-complete problem in Cook's Theorem [28] can be reduced and solved by the proposed DNA-based computing approach if its size is equal to or less than that of the set-packing problem. Otherwise, Cook's Theorem is incorrect on DNA-based computing and a new DNA algorithm should be developed from the characteristics of the NP-complete problem. Finally, the result to DNA simulation is given. Key Words − Molecular Parallel Computing, DNA-based Computing, NP-complete Problem, Cook’s Theorem.
1
1. Introduction
Through advances in molecular biology [1], it is now possible to produce roughly 1018 DNA strands that fit in a test tube. Those 1018 DNA strands can also be applied for representing 1018 bits of information. Basic biological operations can be used to simultaneously operate 1018 bits of information. Or we can say that 1018 data processors can be executed in parallel. Hence, it becomes obvious that biological computing can provide a huge parallelism for dealing with problems in the real world.
Adleman wrote the first paper, which demonstrated that DNA (DeoxyriboNucleic
Acid) strands could be applied for figuring out solutions of an instance in the NP-complete Hamiltonian path problem [2]. Lipton showed that the Adleman techniques could also be used to solve the NP-complete satisfiability problem [3]. Adleman and his co-authors proposed sticker for enhancing the Adleman-Lipton model [14].
In this paper, we use sticker to construct a solution space of DNA for the library
sequences of the set-packing and the clique problems. Simultaneously, we also apply DNA operations in the Adleman-Lipton model to develop DNA algorithms. The results of the proposed DNA algorithms shows that the set-packing and clique problems are resolved with biological operations in the Adleman-Lipton model for solution space of sticker. Furthermore, if the size of a reduced NP-complete problem is equal to or less than that of the set-packing problem, then the proposed algorithm for dealing with the set-packing problem can be used directly for solving the reduced NP-complete problem.
2. DNA Model of Computation
2.1. The Adleman-Lipton Model
It is cited from [1, 16] that DNA (DeoxyriboNucleic Acid) is a polymer, which is
strung together from monomers called deoxyriboNucleotides. Distinct nucleotides are detected using their bases, which come from adenine, guanine, cytosine and thymine. Those bases are abbreviated as A, G, C and T. It is pointed out from [1, 11, 16] that a DNA strand is essentially a sequence (polymer) of four types of nucleotides detected by one of the bases they contain. Under appropriate conditions two strands of DNA can form a double strand, if the respective bases are the Watson-Crick complements
2
of each other – A matches T and C matches G; also the 3’ end matches the 5’ end. The length of a single stranded DNA is the number of nucleotides comprising a single strand. Thus, if a single stranded DNA includes 20 nucleotides, we can say that it is a 20 mer (it is a polymer containing 20 monomers). The length of a double stranded DNA (where each nucleotide is base paired) is counted in the number of base pairs. Thus if we make a double stranded DNA from a single stranded 20 mer, then the length of the double stranded DNA is 20 base pairs, which can also be written as 20 bp. (For more discussion of the relevant biological background refer to [1, 11, 16].)
In the Adleman-Lipton model [2, 3], splints were used to construct the
corresponding edges of a particular graph, of paths, which represented all possible binary numbers. As it stands, their construction indiscriminately builds all splints that lead to a complete graph. This says that hybridization has a higher probability of errors. Hence, Adleman and his co-authors [14] proposed the sticker-based model, which was an abstract model of molecular computing based on DNA with a random access memory as well as a new form of encoding the information.
The DNA operations in the Adleman and Lipton model [2, 3, 11, 12] are
described below. These operations will be used for calculating solutions of the set-packing and clique problems.
The Adleman-Lipton Model:
A (test) tube is a set of molecules of DNA (i.e. a multi-set of finite strings over the alphabet A, C, G, T). Given a tube, one can perform the following operations:
1. Extract. Given a tube P and a short single strand of DNA called S, we can produce two tubes +(P, S) and −(P, S), where +(P, S) is all of the molecules of DNA in P which consist of the short strand S, and −(P, S) is all of the molecules of DNA in P which do not contain the short strand S.
2. Merge. Given tubes P1 and P2, yield ∪(P1, P2), where ∪(P1, P2) = P1 ∪ P2.
This operation is to pour two tubes into one, with no change in the individual strands.
3. Detect. Given a tube P, if it includes at least one DNA molecule we can say
‘yes’, and if it contains no DNA we can say ‘no’.
3
4. Append. Given a tube P and a short strand of DNA called Z, the operation will append the short strand Z onto the end of every strand in tube P.
5. Discard. Given a tube P, the operation will discard the tube P.
6. Read. Given a tube P, the operation is used to describe a single molecule,
which is contained in tube P. Even if P contains many different molecules, the operation can give an explicit description of only one of them.
2.2. Other Related Work and Comparison with the Adleman-Lipton Model
Based on solution space of splint in the Adleman-Lipton model, their methods [7, 17, 18, 19, 20] could be applied towards solving the traveling salesman problem, the dominating-set problem, the vertex cover problem, the clique problem, the independent-set problem, the 3-dimensional matching problem and the set-packing problem. The methods used for resolving the problems have exponentially increased volumes of DNA and linearly increased the time.
Bach et al. [33] proposed a n1.89n volume, Ο(n2 + m2) time molecular algorithm
for the 3-coloring problem and a 1.51n volume, Ο(n2m2) time molecular algorithm for the independent set problem, where n and m are, subsequently, the number of vertices and the number of edges in the problems resolved. Fu [21] presented a polynomial-time algorithm with a 1.497n volume for the 3-SAT problem, a polynomial-time algorithm with a 1.345n volume for the 3-Coloring problem and a polynomial-time algorithm with a 1.229n volume for the independent set. Though the size of those volumes [21, 33] is lower, constructing those volumes is more difficult and the time complexity to the methods is higher.
Quyang et al. [4] showed that enzymes could be used to solve the NP-complete
clique problem. Because the maximum number of vertices that they can process is limited to 27, the maximum number of DNA strands for solving the problem is 227 [4]. Shin et al. [8] presented an encoding scheme for decreasing the error rate of hybridization. This method [8] can be employed towards ascertaining the traveling salesman problem for representing integers and real values with fixed-length codes. Arito et al. [5] and Morimoto et al. [6] proposed a new molecular experimental techniques and a solid-phase method to find a Hamiltonian path. Amos [13] proposed a parallel filtering model for resolving the Hamiltonian path problem, the sub-graph
4
isomorphism problem, the 3-vertex-colorability problem, the clique problem and the independent-set problem. These methods [5, 6, 13] have lowered the error rate in real molecular experiments.
In literatures [26, 27, 30], the methods for DNA-based computing by
self-assembly required the use of DNA nanostructures called tiles to have efficient chemistries, expressive computational power, and convenient input and output (I/O) mechanisms. DNA tiles have lower error rate in self-assembly. Garzon et al [31] introduced a review of the most important advances in molecular computing.
Adleman and his co-authors [14] proposed a sticker-based model to enhance the
error rate of hybridization in the Adleman-Lipton model. Their model can be used for determining solutions of an instance in the set cover problem. Perez-Jimenez et al. [15] employed sticker-based model [14] to resolve knapsack problems. In our previous work, Chang et al [25, 32] also employed the sticker-based model and the Adleman-Lipton model for dealing with the dominating-set problem and the set-splitting problem for decreasing the error rate of hybridization.
3. Using Sticker for Solving the Set-Packing Problem in the Adleman-Lipton Model
3.1. Definition of the Set-Packing Problem
Assume that a finite set S is s1, …, sq, where sm is the mth element for 1 ≤ m ≤ q. Also suppose that |S| is the number of elements in S and |S| is equal to q. Assume that C is a collection of subsets to S and C is equal to C1, …, Cn, where each subset, Ck, is a subset to S for 1 ≤ k ≤ n. Also suppose that |C| is the number of subset in C and |C| is equal to n. Suppose that the number of subsets in C is greater than or equal to l, where l is a positive number. Suppose that set-packing for S is a sub-collection C1 ⊆ C such that C1 contains at least l mutually distinct sets [9, 10]. The set-packing problem is to find the maximum-size set-packing for S and has been proved to be the NP-complete problem [10].
In Figure 1, a finite set S is 1, 2 and a collection C is 1, 2 for S. The set S
and the collection C denote such a problem. The maximum-size set-packing for S is 1, 2. Hence, the size of the set-packing problem in Figure 1 is two. It is indicated from [10] that finding a maximum-size set-packing is a NP-complete problem, so it can be formulated as a “search” problem.
5
S = 1, 2 and C = 1, 2
Figure 1: The finite set and the collection of our problem.
3.2. Using Sticker for Constructing Solution Space of DNA Sequence for the Set-Packing Problem
In the Adleman-Lipton model, the main idea is to first generate solution space of DNA sequences for those problems resolved. Then, basic biological operations are used to remove illegal solutions and to find legal solutions from solution space. Therefore, the first step of resolving the set-packing problem is to produce a test tube, which contains all of the possible set-packing. Assume that an n-digit binary number corresponds to each possible set-packing to any n-subset collection, C. Also suppose that C1 is a set-packing for C. If the ith bit in an n-digit binary number is set to 1, then it represents that the ith subset is in C1. If the ith bit in an n-digit binary number is set to 0, then it represents that the corresponding subset is out of C1.
Assume that q one-digit binary numbers represent q elements in S. Also suppose that the mth one-digit binary number corresponds to the mth element in S. If the kth bit in n bits is set to 1 and the mth element in Ck is distinct from other elements in other subsets, then it represents the corresponding subset in C1 and the mth element in Ck is included in C1. Therefore, it is obvious that the mth one-digit binary number is appended onto the tail of those binary numbers, containing the value 1 of the kth bit. If the kth bit in n bits is set to 0, then it represents that every element in Ck is excluded from C1. This is to say that those elements in Ck are not appended onto the tail of those binary numbers, containing the value 0 of the kth bit.
To implement this, assume that an n-bit binary number Z is represented by a
binary number z1, …, zn, where the value of zk is 1 or 0 for 1 ≤ k ≤ n. A bit zk is the kth bit in an n-bit binary number Z and it represents the kth subset in C. Assume that q one-digit binary numbers are, subsequently, y1, …, yq. Assume that ym represents the mth element in S. An n-bit binary number Z includes all possible 2n choices of subsets. Each choice of subsets corresponds to a possible set-packing. If the value of zk is set to 1 and every element in the subset Ck is distinct from other elements in other subsets, then the value 1 for every element in Ck is, subsequently, appended onto the tail of those binary numbers, including the value 1 of the kth bit.
6
Consider the finite set S and the collection C in Figure 1. The finite set S is 1, 2
and the collection C is 1, 2. Table 1 denotes the solution space for Figure 1. The binary number 00 in Table 1 represents ∅. The binary numbers, 01 and 10, in Table 1 represent 2 and 1, respectively. The binary number 11 in Table 1 represents 1, 2. Though there are four 2-digit binary numbers for representing every possible set-packing, not every 2-digit binary number corresponds to a legal set-packing. Hence, in the next subsection, basic biological operations are used to develop an algorithm for removing the illegal solution and finding the legal solution.
2-digit binary number The corresponding set-packing
00 ∅
01 2
10 1
11 1, 2
Table 1: The solution space for Figure 1.
To represent all possible set-packing for the set-packing problem, sticker [14, 22]
is used to construct solution space for the problem solved. For every bit zk representing the kth subset in C, two distinct 15 base value sequences were designed. One represents the value “0” of zk and the other represents the value “1” of zk. For the sake of convenience in our presentation, assume that zk
1 denotes the value of zk to be 1 and zk
0 defines the value of zk to be 0. Similarly, for every bit ym representing the mth
element in S, two distinct 15 base value sequences are also designed. One represents the value, 1, for ym and the other represents the value, 0, to ym. For the sake of convenience in our presentation, also assume that ym
1 denotes the value of ym to be 1 and ym
0 defines the value of ym to be 0. Each of the 2n possible set-packing was represented by a library sequence of
15*(n + q) bases consisting of the concatenation of one value sequence for each bit. DNA molecules with library sequences are termed library strands and a combinatorial pool containing library strands is termed a library. The probes used for separating the library strands have sequences complementary to the value sequences.
It is pointed out from [14, 22] that errors in the separation of the library strands
7
are errors in the computation. Sequences must be designed to ensure that library strands have little secondary structure that might inhibit intended probe-library hybridization. The design must also exclude sequences that might encourage unintended probe-library hybridization. To help achieve these goals, sequences were computer-generated to satisfy the following constraint [22].
1. Library sequences contain only A's, T's, and C's. 2. All library and probe sequences have no occurrence of 5 or more consecutive
identical nucleotides. 3. Every probe sequence has at least 4 mismatches with all 15 base alignments of
any library sequence (except for its matching value sequence). 4. Every 15 base subsequence of a library sequence has at least 4 mismatches with all 15 base alignments of itself or any other library sequence. 5. No probe sequence has a run of more than 7 matches with any 8 base
alignments of any library sequence (except for its matching value sequence). 6. No library sequence has a run of more than 7 matches with any 8 base
alignments of itself or any other library sequence. 7. Every probe sequence has 4, 5, or 6 G’s in its sequence. Constraint (1) is motivated by the assumption that library strands composed only
of A’s, T’s, and C’s will have less secondary structure than those composed of A’s, T’s, C’s, and G’s [23]. Constraint (2) is motivated by two assumptions: first, those long homopolymer tracts may have unusual secondary structure and second, the melting temperatures of probe-library hybrids will be more uniform if none of the probe-library hybrids involve long homopolymer tracts. Constraints (3) and (5) are intended to ensure that probes bind only weakly where they are not intended to bind. Constraints (4) and (6) are intended to ensure that library strands have a low affinity for themselves. Constraint (7) is intended to ensure that the intended probe-library pairings have uniform melting temperatures.
The Adleman program [22] was modified for generating DNA sequences to
satisfy the constraints above. For example, the two subsets in C of Figure 1, the DNA sequences generated were: z1
0 = AATTCACAAACAATT, z20 =
ACTCCTTCCCTACTC, z11 = TCTCCCTATTTATTT and z2
1 = TCACCAAACCTAAAA. Similarly, for representing every element of S in Figure 1, the DNA sequences generated were: y1
0 = TCTCTCTCTAATCAT, y20 =
ACTCACATACACCAC, y11
= CCATCATCTACCTTA and y21 =
CAACCTATTATCTTA. For every possible set-packing in Figure 1, the corresponding
8
library strand was synthesized by employing a mix-and-split combinatorial synthesis technique [24]. Similarly, for any n-subset collection, all of the library strands for representing every possible set-packing could be also synthesized with the same technique. 3.3. The DNA Algorithm for Solving of the Set-Packing Problem
Algorithm 1: Solving the set-packing problem. (1) Input T0, where tube T0 is to encode all possible 2n choices of subsets in a collection C = C1, …, Cn and every element of each subset in the collection C comes from a finite set S = s1, …, sq. (2) For k = 1 to n, where n is the number of subsets in C.
(2a) TON = +(T0, zk1) and TOFF = −(T0, zk
1). (2b) For m = 1 to |Ck|, where |Ck| is the number of elements in the kth subset in C.
Assume that the mth element in Ck is sm and ym is used to represent it. (2c) TBAD = +(TON, ym
1) and TON = −(TON, ym1).
(2d) Discard(TBAD). (2e) Append(TON, ym
1). End For (2f) T0 = ∪(TON , TOFF).
End For (3) For k = 0 to n-1
For j = k down to 0 (3a) Tj+1
ON = +(Tj, zk+11) and Tj = −(Tj, zk+1
1). (3b) Tj+1 = ∪(Tj+1, Tj+1
ON). EndFor EndFor (4) For k = n down to 1
(a) If (detect (Tk) = 'yes') then (b) Read (Tk) and terminate the algorithm.
EndIf EndFor
Theorem 3-1: From those steps in Algorithm 1, the set-packing problem for a q-element set S and an n-subset collection C can be resolved. Proof:
9
A test tube of DNA strands, that represent all possible 2n input bit sequences z1, …,
zn, is yielded in Step 1. It is obvious that the test tube contains all possible 2n choices of set-packing.
According to the definition of set-packing [10], Step 2 will be at most executed n * q times for checking which subsets are disjoint. When the first execution of Step 2(a) uses the "extraction" operation to form two test tubes: TON and TOFF. The first tube TON includes all of the strands that have zk = 1, that is to say, the first subset in C occurs in tube TON. The second tube TOFF consists of all of the strands that have zk = 0, this is to say that the first subset in C does not appear in tube TOFF. Step 2(b) is an inner loop and is used to examine whether those elements in the kth subset are different from other elements in other subsets. On the first execution of Step 2(c), it uses the "extraction" operation to form two test tubes: TBAD and TON. The first tube TBAD includes all of the strands that have ym = 1. That is to say that the first element in the first subset in C has appeared in tube TBAD. The second tube TOFF contains all of the strands that have ym = 0. This means that the first element of the first subset in C does not appear in tube TON. From the definition of set-packing [10], tube TBAD includes illegal strands. Hence, Step 2(d) applies the “discard" operation to discard tube TBAD. Step 2(e) employs the "append" operation to append 15-based DNA sequences for representing the value 1 of ym into the tail of every strand in TON. Repeat execution of Steps 2(c) to 2(e) until every element in the first subset in C are processed. Then, Step 2(f) uses the "merge" operation to pour two tubes TON and TOFF into tube T0. This is to say that tube T0 currently contains the strands to represent every element in the first subset of C. Similarly, after other subsets are processed, tube T0 consists of those strands for representing a legal set-packing.
Each time the outer loop in Step 3 is executed; the number of executions for the
inner loop is (k + 1) times. On the first execution of the outer loop, the inner loop is only executed one time. Therefore, Steps 3(a) and 3(b) are executed one time. Step 3(a) uses the "extraction" operation to form two test tubes: T1
ON and T0. The first tube T1
ON contains all of the strands that have z1 = 1. The second tube T0 consists of all of the strands that have z1 = 0. That is to say that the first tube encodes every set-packing with the first subset in C and the second tube represents every set-packing without the first subset in C. Hence, Step 3(b) applies the "merge" operation to pour tube T1
ON into tube T1. After repeating execution of Steps 3(a) and 3(b), it finally produces n new tubes. Tube Tk for n ≥ k ≥ 1 then encodes those DNA strands that contain k subsets.
10
Because the set-packing problem is to find a maximum-size set-packing, tube Tn
first is detected using the "detection" operation in Step 4(a). If it returns "yes", then tube Tn at least contains a maximum-size set-packing. Therefore, Step 4(b) uses the "read" operation for describing the ‘sequence’ of a molecule in tube Tn and terminates the algorithm. Otherwise, continue to repeat execution of Step 4(a) until a maximum-size set-packing is detected in the tube.
The set S and the collection C in Figure 1 can be applied for showing the power of
Algorithm 1. In Figure 1, the finite set S is 1, 2 and the collection C is 1, 2. Assume that the collection C is C1, C2, where C1 and C2 are, 1 and 2 respectively. From Step 1 in Algorithm 1, tube T0 is filled with four library strands using the techniques mentioned in subsection 3.2, which represents four possible set-packing for S and C. For the first execution of Step 2(a) in Algorithm 1, two tubes are yielded. The first tube, TON, includes the numbers 1* (* can be either 1 or 0). The second tube, TOFF, contains the numbers 0*. That is to say that the first tube, TON, includes C1 and C1, C2 and the second tube, TOFF, contains Φ and C2. Then, for the first execution of Step 2(c), it uses the "extraction" operation to form two test tubes: TBAD and TON. This is to say that the first element, 1, in C1 has appeared in TBAD and the first element, 1, in C1 does not appear in TON. Step 2(d) applies the "discard" operation to discard tube TBAD. After Step 2(e) is executed, tube TON contains the bit string: 101(y1
1) and 111(y11). Next, after Step 2(f) is executed, tube T0 includes the bit
strings: 00, 01, 101(y11) and 111(y1
1). Similarly, after the second subset, C2 is processed, tube T0 consists of the bit strings: 00, 101(y1
1), 011(y21) and 111(y1
1) 1 (y2
1). The first execution of Step 3(a) in Algorithm 1 applies the "extraction” operation
to produce two tubes, T1ON and T0. Tube T1
ON contains the bit strings: 101(y11), and
111(y11) 1 (y2
1), while tube T0 includes the bit strings: 00 and 011(y21). Step 3(b)
applies the "merge” operation to pour tube T1ON into tube T1. Tube T1 includes the bit
strings, 101(y11), and 111(y1
1) 1 (y21), which represents two set-packing, C1 and C1,
C2. After repeating execution of Steps 3(a) and 3(b), it produces two new tubes. The new tube Tk for 2 ≥ k ≥ 1 encodes the legal set packing that contains k subsets, while tube T2 contains the bit strings, 111(y1
1) 1 (y21). Tube T1 includes the bit strings:
101(y11) and 011(y2
1). The set-packing problem is to find a maximum-size set-packing. The first time of
Step 4(a) is executed; tube T2 is first detected using the "detection" operation. This
11
step returns a “yes” for the “detection” operation for tube T2. Therefore, Step 4(b) applies the "read" operation to describe the ‘sequence’ of a molecule in tube T2 and terminates the algorithm. The maximum-size set-packing is found to be C1, C2.
3.4. The Complexity of Algorithm 1
Theorem 3-2: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with O(q*n + n2) biological operations in the Adleman-Lipton model. Proof:
Algorithm 1 can be applied for solving the set-packing problem for S and C. Algorithm 1 includes three main steps. Step 2 is mainly used to remove illegal library strands from all of the 2n possible library strands. From Algorithm 1, it is obvious that Steps 2(a) through 2(f) use (n + q*n) "extraction" operations, (q*n) "discard" operations, (q*n) "append" operations and n "merge" operations. Step 3 is mainly used to figure out the number of subsets in every legal set-packing. It is indicated from Algorithm 1 that Step 3(a) takes (n*(n+1)/2) "extraction" operations and Step 3(b) takes (n*(n+1)/2) "merge" operations. Step 4 is employed to find a maximum-size set-packing from legal set-packing. It is pointed out from Algorithm 1 that Step 4(a) at most takes n "detection" operations and Step 4(b) takes one "read" operation. Hence, from the statements above, it is inferred that the time complexity of Algorithm 1 is O(q*n + n2) biological operations in the Adleman-Lipton model. Theorem 3-3: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with O(2n) library strands in the Adleman-Lipton model. Proof: Refer to Theorem 3-2. Theorem 3-4: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with O(n) tubes in the Adleman-Lipton model. Proof: Refer to Theorem 3-2.
12
Theorem 3-5: Suppose that a finite set S is s1, …, sq and a collection C is C1, …, Cn, where Ck is a subset of elements from S for 1 ≤ k ≤ n. The set-packing problem for S and C can be solved with the longest library strand, O(15*n + 15*q), in the Adleman-Lipton model. Proof: Refer to Theorem 3-2. 3.5. Range of Application to Cook’s Theorem on DNA-Based Computing
Cook’s Theorem [28] is that if one algorithm for one NP-complete problem is developed, then other problems will be solved by means of reduction to that problem. Cook’s Theorem has been demonstrated to be right using a general digital electronic computer. Lets assume that a graph G = (V, E), where V is the set of vertices and E is the set of edges. Also suppose that |V| and |E| are the number of vertices and the number of edges. Mathematically, a clique for a graph G = (V, E) is a subset V1 ⊆ V of vertices such that every two vertices in V1 are joined by an edge in E [9, 10, 29]. The clique problem is to find a maximum-size clique in G. The simple structure for the clique problem makes it one of the most widely used problems for other NP-complete results [9, 28, 29]. The following theorem is used to describe the range of application for Cook’s Theorem on DNA-based computing. Theorem 3-6: Assume that any other NP-complete problems can be reduced to the set-packing problem with a polynomial time algorithm in a general electronic computer. If the size of a reduced NP-complete problem is less than or equal to that of the set-packing problem, then Cook’s Theorem is correct on DNA-based computing. Proof: We transform the clique problem to the set-packing problem with a polynomial time algorithm [29]. Assume that a graph G = (V, E), where V is v1, v2, …, vn and E is (vp, vq)| (vp, vq) is an edge in G. Assume that |E| is at most (n * (n − 1)) / 2. Suppose that a graph G is any instance for the clique problem. We construct a finite set S and a collection C, where S = s1, …, s|E| and C = C1, …, C|V| such that S and C have a maximum-size set-packing if and only if G has a maximum-size clique.
From [29], each vertex uk in V corresponds to a subset Ck in C for 1 ≤ k ≤ n. Each edge in E also corresponds to an element sm in S for 1 ≤ m ≤ |E|. Each subset Ck in C for 1 ≤ k ≤ n is sm| sm represents an edge containing the vertex vk for 1 ≤ m ≤ |E|.
13
Setting S and C from G completes the construction of our instance to the set-packing problem. Therefore, the number of elements in S and C are |E| and n respectively. Algorithm 1 is used to determine the set-packing problem for S and C with 2 n DNA strands, O(|E| * n + n2) biological operations, O(n) tubes and the longest library strand, O(15*n + 15 * |E|). That is to say that Algorithm 1 can be applied to solve the clique problem by means of reducing it to the set-packing problem. Hence, it is derived that if the size of a reduced NP-complete problem is less than or equal to that of the set-packing problem, then Cook’s Theorem is correct on DNA-based computing.
From Theorem 3-6, if the size of a reduced NP-complete problem is equal to or less than that of the set-packing problem, then Algorithm 1 can be directly used for solving the reduced NP-complete problem. Otherwise, a new DNA algorithm should be developed from the characteristics of the NP-complete problem.
4. Using Sticker for Solving the Clique Problem in the Adleman-Lipton Model
4.1. The DNA Algorithm for Solving the Clique Problem
Suppose that a graph G = (V, E), where V is v1, v2, …, vn and E is (vp, vq)| (vp, vq) is an edge in G. Also Assume that |V| is n and |E| is at most (n * (n − 1)) / 2. Assume that a complementary graph G1 = (V, E1) for G, where E1 is (vc, vd)| (vc, vd) is out of E. Assume that |E1| is (n * (n − 1)) / 2 − |E|. A clique for a graph G = (V, E) is a complete sub-graph to G [29]. The clique problem is to find a maximum-size clique in G. Algorithm 2 is presented for solving the clique problem and the notations in Algorithm 2 are similar to those described in Section 3.2. Algorithm 2: Solving the clique problem. (1) Input (T0), where tube T0 is to encode all possible 2n choices of clique for a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. (2) For m = 1 to (n * (n − 1)) / 2 − |E|.
Assume that em is an edge in G1 and em = (vc, vd). Also suppose that zc and zd, respectively, represent vc and vd. (a) TON = +(T0, zc
1) and TOFF = −(T0, zc1).
(b) TON1 = +(TON, zd
1) and TOFF1 = −(TON, zd
1). (c) Discard(TON
1). (d) T0 = ∪(TOFF, TOFF
1).
14
EndFor (3) For k = 0 to n-1
For j = k down to 0 (3a) Tj+1
ON = +(Tj, zk+11) and Tj = −(Tj, zk+1
1). (3b) Tj+1 = ∪(Tj+1, Tj+1
ON). EndFor EndFor (4) For k = n down to 1
(a) If (detect (Tk) = 'yes') then (b) Read (Tk) and terminate the algorithm.
EndIf EndFor Theorem 4-1: From those steps in Algorithm 2, the clique problem for a graph G = (V, E) can be solved, where |V| is n and |E| is at most (n * (n − 1)) / 2. Proof:
A test tube of DNA strands, that represent all possible 2n input bit sequences z1, …,
zn, is yielded in Step 1. It is obvious that the test tube contains all possible 2n choices of clique.
From the definition of clique [10], Step 2 will be executed ((n * (n − 1)) / 2 − |E|) times for checking which strands are legal. When the first execution of Step 2(a) uses the "extraction" operation to form two test tubes: TON and TOFF. The first tube TON includes all of the strands that have zc = 1. That is to say that the cth vertex in V occurs in tube TON. The second tube TOFF consists of all of the strands that have zc = 0. This is to say that the cth vertex in V does not appear in tube TOFF. Then, at the first execution of Step 2(b), it uses the "extraction" operation to form two test tubes: TON
1 and TOFF1.
The first tube TON1 includes all of the strands that have zc = 1 and zd = 1. That is to say
that tube TON1 includes vertices, which are not connected in G. The second tube TOFF
1 contains all of the strands that have zc = 1 and zd = 0. This is to say that the cth vertex in V occurs in TOFF
1 and the dth vertex in V does not appear in TOFF1. From the
definition of clique [10], tube TON1 includes illegal strands. Hence, Step 2(c) applies
the "discard" operation to discard tube TON1. Next, Step 2(d) uses the "merge"
operation to pour the two tubes TOFF and TOFF1 into tube T0. This is to say that tube T0
currently contains legal solutions. After the remaining steps are processed, tube T0 consists of those strands for representing legal clique.
15
Each time the outer loop in Step 3 is executed; the number of executions for the
inner loop is (k + 1) times. On the first execution of the outer loop, the inner loop is only executed one time. Therefore, Steps 3(a) and 3(b) are executed once. Step 3(a) uses the "extraction" operation to form two test tubes: T1
ON and T0. The first tube T1ON
contains all of the strands that have z1 = 1. The second tube T0 consists of all of the strands that have z1 = 0. That is to say that the first tube encodes every clique with the first vertex in G and the second tube represents every clique without the first vertex in G. Then, Step 3(b) applies the "merge" operation to pour tube T1
ON into tube T1. After repeating execution of Steps 3(a) and 3(b), it finally produces n new tubes. The tube Tk for n ≥ k ≥ 1 encodes those DNA strands that contain k vertices.
Because the clique problem is to find a maximum-size clique, tube Tn first is
detected with the "detection" operation in Step 4(a). If it returns a "yes", then tube Tn at least contains a maximum-size clique. Therefore, Step 4(b) uses the "read" operation for describing ‘sequence’ of a molecule in tube Tn and terminating the algorithm. Otherwise, continue to repeat execution Step 4(a) until a maximum-size clique is found in the tube detected. 4.2. The Complexity of Algorithm 2
Theorem 4-2: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with O(n2 − |E|) biological operations in the Adleman-Lipton model. Proof:
Algorithm 2 can be applied for solving the clique problem for G. Algorithm 2 includes three main steps. Step 2 is mainly used to remove illegal library strands from all of the 2n possible library strands. From Algorithm 2, Steps 2(a) through 2(d) take (2 * ((n * (n − 1)) / 2 − |E|)) for the "extraction" operations, (n * (n − 1)) / 2 − |E|) for the "discard" operations and (n * (n − 1)) / 2 − |E|) for the "merge" operations. Step 3 is mainly used to figure out the number of vertices in every legal clique. It is indicated from Algorithm 2 that Step 3(a) takes (n*(n+1)/2) "extraction" operations and Step 3(b) takes (n*(n+1)/2) "merge" operations. Step 4 is employed to find a maximum-size clique from legal clique. It is pointed out from Algorithm 2 that Step 4(a) at most takes n "detection" operations and Step 4(b) takes one "read" operation. Hence, from the statements mentioned above, it is inferred that the time complexity of
16
the worst-case for Algorithm 2 is O(n2 − |E|) biological operations in the Adleman-Lipton model. Theorem 4-3: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with O(2n) library strands in the Adleman-Lipton model. Proof: Refer to Theorem 4-2. Theorem 4-4: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with O(n) tubes in the Adleman-Lipton model. Proof: Refer to Theorem 4-2. Theorem 4-5: Suppose that a graph G = (V, E), where |V| is n and |E| is at most (n * (n − 1)) / 2. The clique problem for G can be solved with the longest library strand, O(15*n), in the Adleman-Lipton model. Proof: Refer to Theorem 4-2. 4.3. The Comparison of Algorithm 2 with Algorithm 1 for Dealing with the Clique Problem
From Theorem 3-6, the clique problem for graph G = (V, E) can be directly reduced to the set-packing problem for a finite set S and a collection C. Algorithm 1 can be directly applied to the clique problem for graph G = (V, E) with 2 n DNA strands, O(|E| * n + n2) biological operations, O(n) tubes and the longest library strand, O(15 * n + 15 * |E|). From the characteristics of the clique problem for graph G = (V, E), Algorithm 2 is proposed to deal with the same problem of 2n library strands, O(n2 − |E|) biological operations, O(n) tubes and the longest library strand, O(15*n). From the statements above, solving the clique problem, graph G = (V, E), Algorithm 2 is better than Algorithm 1 in the form of biological operations and the longest library strand. This is to imply that a NP-complete problem should correspond to a new DNA algorithm. 5. Experimental Results of Simulated DNA Computing
17
We modified the Adleman program [22] using a Pentium(R) 4 and 128 MB of main memory. The operating system used is Window 98 and Visual C++ 6.0 compiler. The program modified was applied to generating DNA sequences for solving the set-packing problem and the clique problem. Because the source code of the two functions srand48() and drand48() was not found in the original Adleman program, we used the standard function srand() in Visual C++ 6.0 to replace the function srand48() and added the source code to the function drand48(). We also added subroutines to the Adleman program for simulating biological operations in the Adleman-Lipton model in Section 2. We added subroutines to the Adleman program to simulate Algorithm 1 and Algorithm 2. For the convenience of the presentation, simulation of Algorithm 1 is described below in detail. Simulation of Algorithm 2 is similar to that of Algorithm 1.
The Adleman program was used to construct each 15-base DNA sequence for
every bit of the library. For each bit, the program generates two 15-base random sequences (‘1’ and ‘0’) checking to see if the library strands satisfy the seven constraints in subsection 3.2 with the new DNA sequences added. If the constraints are satisfied, the new DNA sequences are ‘greedily’ accepted. If the constraints are not satisfied then mutations are introduced one by one into the new block until either: (A) the constraints are satisfied and the new DNA sequences are then accepted or (B) a threshold for the number of mutations is exceeded and the program has failed and so it exits, printing the sequence found so far. If (n + q)-bits that satisfy the constraints are found then the program has succeeded and it outputs these sequences.
Consider the finite set S and the collection C in Figure 1. The collection C includes two subsets: 1 and 2. The finite set S is 1, 2. DNA sequences generated by the modified Adleman program are shown in Table 2. The program took one mutation to make new DNA sequences for 1, 2, 1 and 2. With the nearest neighbor parameters, the Adleman program was used to calculate the enthalpy, entropy, and free energy for the binding of each probe to its corresponding region on a library strand. The energy used was shown in Table 3. Only G really matters to the energy of each bit. For example, the delta G for the probe binding a '1' in the first bit is 25 kcal/mol and the delta G for the probe binding a '0' is estimated to be 24.3 kcal/mol.
Bit 5’→ 3’ DNA Sequence z1
0 AATTCACAAACAATT z2
0 ACTCCTTCCCTACTC z1
1 TCTCCCTATTTATTT
18
z21 TCACCAAACCTAAAA
y10 TCTCTCTCTAATCAT
y20 ACTCACATACACCAC
y11 CCATCATCTACCTTA
y21 CAACCTATTATCTTA
Table 2: Sequences chosen were used to represent the two subsets in C and every element of S in Figure 1.
Bit Enthalpy energy (H) Entropy energy (S) Free energy (G) z1
1 114.4 299.4 25 z1
0 107.8 278.6 24.3 z2
1 111.5 284.8 26.2 z2
0 109.1 279 25.9 y1
1 105.2 270.5 24.4 y1
0 97.3 252.3 22.1 y2
1 107 282.4 22.5 y2
0 94.7 239.8 22.8
Table 3: The energy for binding of each probe to its corresponding region on a library strand.
The program simulated a mix-and-split combinatorial synthesis technique [24] to synthesize the library strand to every possible set-packing. These library strands are shown in Table 4 and represent every possible set-packing: ∅, 2, 1 and 1, 2. The program also figured out the average and standard deviation for the enthalpy, entropy and free energy over all probe/library strand interactions. The energy levels are shown in Table 5. The standard deviation for delta G is small because this is partially enforced by the constraint that there are 4, 5, or 6 G’s (the seventh constraint in subsection 3.2) in the probe sequences.
∅ 5'-AATTCACAAACAATTACTCCTTCCCTACTC-3' 3'-TTAAGTGTTTGTTAATGAGGAAGGGATGAG-5'
2 5'-AATTCACAAACAATTTCACCAAACCTAAAA-3' 3'-TTAAGTGTTTGTTAAAGTGGTTTGGATTTT-5'
1 5'-TCTCCCTATTTATTTACTCCTTCCCTACTC-3' 3'-AGAGGGATAAATAAATGAGGAAGGGATGAG-5'
12 5'-TCTCCCTATTTATTTTCACCAAACCTAAAA-3' 3'-AGAGGGATAAATAAAAGTGGTTTGGATTTT-5'
Table 4: DNA sequences chosen were used to represent every possible set-packing in tube T0.
19
Enthalpy energy (H) Entropy energy (S) Free energy (G) Average 105.875 273.35 24.15
Standard deviation 6.31031 17.7764 1.45002
Table 5: The energy over all probe/library strand interactions.
The Adleman program was employed for computing the distribution of the different types of potential mishybridizations. The distribution of the types of potential mishybridizations is the absolute frequency of a probe-strand match of length k from 0 to the bit length 15 (for DNA sequences) where probes are not supposed to match the strands. The distribution was, subsequently, 120, 209, 265, 378, 418, 396, 279, 147, 78, 32, 16, 0, 0, 0, 0 and 0. It is pointed out from the last five zeros that there are 0 occurrences where a probe matches a strand at 11, 12, 13, 14 or 15 places. This shows that the third constraint in subsection 3.2 has been satisfied. Hence, the number of matches peaks at 4(418). That is to say that there are 418 occurrences where a probe matches a strand at 4 places. Bit string to solution 5’→ 3’ DNA Sequence
00 5'-AATTCACAAACAATTACTCCTTCCCTACTC-3' 101(y1
1) 5'-TCTCCCTATTTATTTACTCCTTCCCTACTCCCATCATCTACCTTA-3'
011(y21) 5'-AATTCACAAACAATTTCACCAAACCTAAAACAACCTATTATCT
TA-3' 111(y1
1) 1(y2
1) 5'-TCTCCCTATTTATTTTCACCAAACCTAAAACCATCATCTACCTTACAACCTATTATCTTA-3'
Table 6: DNA sequences generated by Step 2 were applied to represent legal solutions in tube T0.
Tube Bit string to solution 5’→ 3’ DNA Sequence
T0 00 5'-AATTCACAAACAATTACTCCTTCCCTACTC-3'
T1 101(y1
1) 5'-TCTCCCTATTTATTTACTCCTTCCCTACTCCCATCATCTACCTTA-3'
011(y21) 5'-AATTCACAAACAATTTCACCAAACCTAAAACAACCTATTATCT
TA-3' T2 111(y1
1) 1(y2
1) 5'-TCTCCCTATTTATTTTCACCAAACCTAAAACCATCATCTACCTTACAACCTATTATCTTA-3'
Table 7: DNA sequences generated by Step 3 were employed to represent legal solutions in tubes T0, T1 and T2.
20
Bit string To solution 5’→3’ DNA sequence
111(y11)
1(y21)
5'-TCTCCCTATTTATTTTCACCAAACCTAAAACCATCATCTA CCTTACAACCTATTATCTTA-3'
Table 8: The answer was found from Step 4 in tube T2.
The results for simulation of Steps 2 through 4 are shown in Tables 6, 7 and 8.
From tube T2, the answer was found to be 1, 2.
6. Conclusions and Future Works
The proposed algorithms (Algorithm 1 and 2) for solving the set-packing and clique problems are based on biological operations in the Adleman-Lipton model and the solution space of stickers in the sticker-based model. The present algorithms have several advantages from the Adleman-Lipton model and the sticker-based model. First, the proposed algorithm actually has a lower rate of errors for hybridization because we modified the Adleman program to generate good DNA sequences for constructing the solution space of stickers to the set-packing and clique problems. Only simple and fast biological operations in the Adleman-Lipton model were employed to solve the problems. Secondly, those biological operations in the Adleman-Lipton model had been performed in a fully automated manner in their lab. The full automation manner is essential not only for the speedup of computation but also for error-free computation. Thirdly, in Algorithm 1 for solving the set-packing problem, the number of tubes, the longest length of DNA library strands and the number of DNA library strands, respectively, are O(n), O(15*n + 15*q) and O(2n). In Algorithm 2 for solving the clique problem, the number of tubes, the longest length of DNA library strands and the number of DNA library strands, respectively, are O(n), O(15*n) and O(2n). This implies that the present algorithms can be easily performed in a fully automated manner in a lab. Furthermore, the present algorithms generate 2n library strands, which satisfy the seven constraints in subsection 3.2, which corresponds to 2n possible solutions. This allows the present algorithms to be applied to a larger instance of the set-packing and clique problems. Fourthly, from the statements in subsection 4.3, for solving the clique problem, Algorithm 2 is better than Algorithm 1 in the form of biological operations and the longest library strand. Therefore, this seems to imply that Cook’s Theorem is unsuitable on a DNA-based computer and a NP-complete problem should correspond to a new DNA algorithm.
Currently, there are lots of NP-complete problems that cannot be solved because it
21
is very difficult to support basic biological operations using mathematical operations. We are not sure whether molecular computing can be applied to dealing with every NP-complete problem. Therefore, in the future, our main work is to solve other NP-complete problems that were unresolved with the Adleman-Lipton model and the sticker model.
22
References
[1] R. R. Sinden. DNA Structure and Function. Academic Press, 1994.
[2] L. Adleman. Molecular computation of solutions to combinatorial problems. Science, 266:1021-1024, Nov. 11, 1994.
[3] R. J. Lipton. DNA solution of hard computational problems. Science, 268:542:545, 1995.
[4] Q. Quyang, P.D. Kaplan, S. Liu, and A. Libchaber. DNA solution of the maximal clique problem. Science, 278:446-449, 1997.
[5] M. Arita, A. Suyama, and M. Hagiya. A heuristic approach for Hamiltonian path problem with molecules. Proceedings of 2nd Genetic Programming (GP-97), 1997, pp. 457-462.
[6] N. Morimoto, M. Arita, and A. Suyama. Solid phase DNA solution to the Hamiltonian path problem. DIMACS (Series in Discrete Mathematics and Theoretical Computer Science), Vol. 48, 1999, pp. 93-206.
[7] A. Narayanan, and S. Zorbala. DNA algorithms for computing shortest paths. In Genetic Programming 1998: Proceedings of the Third Annual Conference, J. R. Koza et al. (Eds), 1998, pp. 718--724.
[8] S.-Y. Shin, B.-T. Zhang, and S.-S. Jun. Solving traveling salesman problems using molecular programming. Proceedings of the 1999 Congress on Evolutionary Computation (CEC99), vol. 2, pp. 994-1000, 1999.
[9] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms.
[10] M. R. Garey, and D. S. Johnson. Computer and intractability. Freeman, San Fransico, CA, 1979.
[11] D. Boneh, C. Dunworth, R. J. Lipton and J. Sgall. On the computational Power of DNA. In Discrete Applied Mathematics, Special Issue on Computational Molecular Biology, Vol. 71 (1996), pp. 79-94.
[12] L. M. Adleman. On constructing a molecular computer. DNA Based Computers, Eds. R. Lipton and E. Baum, DIMACS: series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society. 1-21 (1996)
[13] M. Amos. “DNA Computation”, Ph.D. Thesis, department of computer science, the University of Warwick, 1997.
[14] S. Roweis, E. Winfree, R. Burgoyne, N. V. Chelyapov, M. F. Goodman, Paul W.K. Rothemund and L. M. Adleman. “A Sticker Based Model for DNA
23
Computation”. 2nd annual workshop on DNA Computing, Princeton University. Eds. L. Landweber and E. Baum, DIMACS: series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society. 1-29 (1999).
[15] M.J. Perez-Jimenez and F. Sancho-Caparrini. “Solving Knapsack Problems in a Sticker Based Model”. 7nd annual workshop on DNA Computing, DIMACS: series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society, 2001.
[16] G. Paun, G. Rozenberg and A. Salomaa. DNA Computing: New Computing Paradigms. Springer-Verlag, New York, 1998. ISBN: 3-540-64196-3.
[17] W.-L. Chang and M. Guo. "Solving the Dominating-set Problem in Adleman-Lipton’s Model". The Third International Conference on Parallel and Distributed Computing, Applications and Technologies, Japan, 2002, pp. 167-172.
[18] W.-L. Chang and M. Guo. " Solving the Clique Problem and the Vertex Cover Problem in Adleman-Lipton’s Model". IASTED International Conference, Networks, Parallel and Distributed Processing, and Applications, Japan, 2002, pp. 431-436. [19] W.-L. Chang, M. Guo. "Solving NP-Complete Problem in the Adleman-Lipton Model". The Proceedings of 2002 International Conference on Computer and Information Technology, Japan, 2002, pp. 157-162. [20] W.-L. Chang and M. Guo. " Resolving the 3-Dimensional Matching Problem and the Set-Packing Problem in Adleman-Lipton’s Model". IASTED International Conference, Networks, Parallel and Distributed Processing, and Applications, Japan, 2002, pp. 455-460.
[21] Bin Fu. "Volume Bounded Molecular Computation". Ph.D. Thesis, Department of Computer Science, Yale University, 1997.
[22] Ravinderjit S. Braich, Clifford Johnson, Paul W.K. Rothemund, Darryl Hwang, Nickolas Chelyapov and Leonard M. Adleman. "Solution of a satisfiability problem on a gel-based DNA computer". Proceedings of the 6th International Conference on DNA Computation in the Springer-Verlag Lecture Notes in Computer Science series.
[23] Kalim Mir. “A restricted genetic alphabet for DNA computing”. Eric B. Baum and Laura F. Landweber, editors. DNA Based Computers II: DIMACS Workshop, June 10-12, 1996, volume 44 of DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, Providence, RI, 1998, pp. 243-246.
[24] A. R. Cukras, Dirk Faulhammer, Richard J. Lipton, and Laura F. Landweber. “Chess games: A model for RNA-based computation”. In Proceedings of the 4th DIMACS Meeting on DNA Based Computers, held at the University of Pennsylvania, June 16-19, 1998, pp. 27-37.
24
[25] W.-L. Chang and M. Guo. "Using Sticker for Solving the Dominating-set Problem in the Adleman-Lipton Model". IEICE Transaction on Information System, 2003, accepted. [26] Reif, J.H., LaBean, T.H. and Seeman. "Challenges and Applications for Self-Assembled DNA-Nanostructures". Proceedings of the 6th DIMACS Workshop on DNA Based Computers (meeting at Leiden, Holland, June 2000). [27] T. H. LaBean, E. Winfree and J.H. Reif. "Experimental Progress in Computation by Self-Assembly of DNA Tilings". Theoretical Computer Science, Volume 54, pp. 123-140, (2000).
[28] Cook, S. A. “The Complexity of Theorem-proving Procedures”. Proc. 3rd Ann. ACM Sym. On Theory of Computing, ACM, New York, pp. 151-158, 1971.
[29] Karp, R. M. “Reducibility among combinatorial Problems”. Complexity of Computer Computation, Plenum Press, New York, pp. 85-103, 1972.
[30] M.C. LaBean, T.H. Reif and J.H Seeman."Logical computation using algorithmic self-assembly of DNA triple-crossover molecules". Nature, 407, pp., 493-496.
[31] M. H. Garzon and R. J. Deaton. "Biomolecular Computing and Programming". IEEE Transactions on Evolutionary Computation, Volume 3, pp. 236-250, 1999.
[32] W.-L. Chang, M. Guo and M. Ho. " Solving the Set-splitting Problem in Sticker-based Model and the Adleman-Lipton Model". The 2003 International Symposium on Parallel and Distributed Processing and Applications, Aizu City, Japan, 2003, accepted.
[33] E. Bach, A.Condon, E. Glaser and C. Tanguay. "DNA models and algorithms for NP-complete problems". In Proceedings of the 11th Annual Conference on Structure in Complexity Theory, pp. 290-299, 1996.
25
1
HPM: A Hierarchical Model for Parallel Computations*
Xiangzhen Qiao & Shuqing Chen
The Institute of Computing Technology, Chinese Academy of Sciences Beijing, P R China 100080
Abstract A hierarchical model for parallel computations is introduced and evaluated in this paper. This
model describes the general homogeneous parallel computer systems with Hierarchical Parallelism and
hierarchical Memories (named as HPM). A parallel function HP describes the multi-level parallelism of the
system. The memory function HM shows the characteristics of the hierarchical memories. The binding
relation HB gives the connection relation of parallelism and hierarchical memories. Some commercial
parallel computers are illustrated according to the HPM model. The performance of parallel algorithms is
analyzed from the parallelism and locality point of view.
Key Words: HPM, E-RAM, Binding relation, Vertical points, Crossing points.
1. Introduction
A fundamental challenge in the parallel processing is to develop effective models for the
parallel computation. A number of models for parallel algorithm design and analysis have
been proposed, refined and studied in the last 20 years [1]. Primary among them are: PRAM [2],
BSP[3] and LogP[4]. Several models are primarily concerned with the memory hierarchy,
which focus on accounting for various levels of the storage hierarchy [2].
The standard model, the Random Access Machine (RAM) [1], does not represent memory
hierarchies. It assumes unit-time access to all memory cells, regardless of large cost
differences between accesses to the cache, main memory, and disks. The differences are
unlikely to disappear in the future. For most algorithms, the RAM assumption is too
inaccurate, and some alternative models have been proposed, such as the uniform memory
hierarchy model (UMH) [5]. Memory-hierarchy models have been extended by parallelism,
and parallel models have been extended by a representation of memory hierarchy [5].
A hierarchical model for parallel computations is introduced in this paper. This model
describes the general homogeneous parallel computer systems with Hierarchical Parallelism
and hierarchical Memories (HPM). A parallel function HP is defined to describe the
* Supported by the National Natural Science Foundation of China (Grant 69933020).
2
multi-level parallelism of the system. The function HM is used to show the characteristics of
hierarchical memories. The topology of the system is not addressed, but a binding relation HB
is defined to describe the connection of parallelism and hierarchical memories. Four
parameters are used to analyze the performance of the HPM model: p (number of processors);
s (peak performance of a processor in the HPM); k and k+
(decreasing ratio of the
performance of the memory hierarchy to the cpu). Some commercial parallel computers are
illustrated according to the HPM model. The parallel HPM algorithm is defined in this paper.
A parallel HPM algorithm consists of a set of p independent sub-algorithms, each running on
a processor of an HPM system. These sub-algorithms are interacted through some interactive
points. The sub-algorithms are different from the general RAM algorithm, and we call them
as the Enhanced RAM(E-RAM) Algorithm: a traditional RAM algorithm with the
consideration of the parallelism and memory hierarchies inside a cpu. The performance of
parallel algorithms is analyzed from the parallelism and locality point of view. The
performance of an HPM algorithm depends on the E-RAM algorithm, the parallelism behavior
of the algorithm, and the influence of the computing (parallel) interactive points and data
interactive points. The real performance of the algorithm will depend on both sides of the
architecture and algorithm characteristics.
Compared with other memory hierarchy models, the most significant characteristics of
the HPM model is that it is computing-centric (parallelism - centric). Parallelism is the
everlasting factor to influence the performance of parallel algorithms. The second is that, the
connection of parallelism and the locality is given in a relation HB. The influence of memory
hierarchies on the computation can easily be analyzed from HB. Third, the “synchronism” is
not defined in this model. In fact, it is an asynchronous model. The synchronism is hidden in
the data communication between the processors. Fourth, the parallel systems are described
from the “parallelism” and the “locality”. If only the “parallelism” is concerned, the HPM
model is equivalent to a PRAM model. Most of the present parallel algorithms are designed in
the PRAM model. Thus many HPM algorithms can be transferred from PRAM algorithms. In
this paper, we mainly discuss the model and its performance analysis based on scientific
computing.
The remainder of the paper is organized as follows. The HPM model is defined in the
3
section 2, some examples are given. The parallel algorithms based on the HPM model are
defined in the section 3. The performance of parallel algorithms is analyzed in section 4. The
paper ends with the conclusion.
2. The HPM model
The HPM model, denoted as HPM (hp, hm, hB), describes the general homogeneous
parallel computer systems with Hierarchical Parallelism and hierarchical Memories. It defines
a parallel computer system from 3 respects.
2.1 Hierarchical parallelism
A parallel function HP describes the multi-level parallelism of an HPM system:
HP = p0, p1, ……, php ……………………………… (1)
Here hp is the number of the levels of the parallelism in the HPM system. p0 is the parallelism
inside a cpu, including the characteristics of pipeline, multiple function units, chaining, and so
on. In fact, p0 is equal to the number of floating point operations, which can be operated by a
processor in one clock cycle. On the i-th (1ihp) level of the parallelism, there are multiple
homogeneous sub-systems. The “macro parallelism” of a sub-system on the i-th level on the
HPM system is given by pi (1ihp). In general, on the i-th (2i<hp) level, there are ps1
(ps1=php*php-1 *……*pi) sub-systems each with parallelism ps2 (ps2=p1*p2……*pi-1).
On the hp-th level of the parallelism, the parallel system has php sub-systems, each with
parallelism ps2 (ps2=p1*……*php-1). When i=1, ps1=p (p=p1*p2……*php), ps2=1. For
simplicity and without loss of generality, here we only discuss homogeneous systems.
If the peak performance of a processor in the HPM(hp,hm,hB) system is s, then the peak
performance Shpm of above HPM will be Shpm= s * p1* p2* … * php = s * p.
2.2 Hierarchical memories and binding relation
The memory system of HPM model is consisted of multi-level hierarchical memories.
For the sequence = MUk (1khm) of memory modules, is a memory hierarchy.
There are total hm levels of hierarchical memories. Module MUk can hold multiple blocks, a
“bus” BUSk connects the MUk to the next module MUk+1. Blocks are further partitioned into
sub-blocks. The sub-block is the unit of transfer along the bus from the previous module. We
call the sub-block as basic block in this paper. All buses can be active concurrently. The only
4
restriction is that no data item in the block being transferred on BUSi is available to buses
BUSk-1 or BUSk+1 until the transfer of the entire block is completed. Here MU1 can be the
caches, MU2 main memory, MU3 hard disk …. Three important characteristics are required for
the messages stored in MUk: coherency, inclusion, and locality [7]. Generally speaking, the
transformation of the messages between the adjacent levels of the memory hierarchy is
consecutive. The nearer the cpu, the higher transfer speed, and smaller capacity of the
memory module. Thus, the hit rate and the reuse possibility are important for the memory
system [7].
A binding relation HB describes the connection relation of parallelism and locality:
HB = hl1, hl2, … , hlhp ……………………… (2)
Here hli (1ihp) describes the data transfer style between the pi subsystems on the i-th
parallel level. In general, hli can take two kinds of values: positive integer k (1khm)
means the information exchange between the pi subsystems is through data accesses to a
common shared memory (based on MUk). The hli can also be the value of k+, which means
the data transfer between the pi subsystems is through message passing in a network
connecting the pi subsystems. The network is based on the k-th level of the memory hierarchy
- MUk of every pi subsystems.
2.3 The performance of the memory hierarchy
In this paper we take the general L, B analyses for memory and communication timing.
Thus, the performance of the memory hierarchy for HPM model can be expressed by:
fj lj, Bj, Lj, gj(qj) ……………………… (3)
or
fj+ lj
+, Bj+, Lj
+, gj+ (qj) ……………………… (4)
In the equation (3), we have 1 memory
hierarchy. Uj is the capacity of MUj, lj is the length of the basic block for hierarchy MUj, Bj is
the bandwidth of the data transfer between MUj and MUj+1, Lj is the latency (including the
hardware and software latency). The gj (qj) is the penalty of memory accesses conflicts. If the
memory is private for a processor, we have gj (qj) = 0, otherwise gj (qj) 0.
For equation (4), we have hm+1, here hc is the number of hlj, taking value j+
(1j. The lj+
is the length of the basic block between MUj of different sub-systems,
5
which date communication is based on memory hierarchy j. Bj+
is the bandwidth of the data
transfer. Lj+ is the latency, and gj
+ (qj) is the penalty of communication conflicts.
In general, the time for transferring the data with length nl between memory hierarchy
MUj and MUj+1 can be expressed by the following equation:
tcj=([nl/lj]* lj)/ Bj + Lj ……………………… (5)
Here, 1j. For the same data, the time for message passing on the j-th level will be
(hm<j
tcj+=([nl/lj
+]*lj+)/ Bj
+ + Lj+ ……………………… (6)
Generally, we have: Uj-1<Uj (1<jj-1<lj, Bj-1>Bj, Lj-1<Lj, and Bj>Bj+, Lj<Lj
+.
2.4 Examples of HPM model
The HPM model can be used to describe general commercial computer systems. For
example:
A common sequential computer can be described as: HPM(0, 3, hB), HP = p0, p0 1.
This system consisted of a cpu and memory system. The memory hierarchies can be caches,
main memory and hard disk. When the computer has parallelism inside the cpu, that is,
multiple floating point operations can be completed in one clock cycle, p0 > 1. For example, to
a PowerPc 604e computer, we have p02. HB is not defined for a sequential computer.
The Symmetric shared memory multiprocessor (SMP) system can be defined by HPM
model as HPM (1, 3, hB), HP = p0, p1. Here p0 has the same meaning as above, p1 is the
number of cpu‘s in an SMP. The memory hierarchies of an SMP are MUi (1
corresponding to cache, main memory and hard disk respectively. HB2, this means that
the information transferring between the processors are through the second level of the
memory hierarchies – common shared memory. For a IBM Power II-270 SMP[8]
multiprocessor, p02p12. For a Power3-II[8] SMP multiprocessor, p04p14.
The distributed memory parallel computers and cluster of computers can be described by
the HPM model as: HPM(1, 3, hB), and HP = p0, p1. Here p0 has the same meanings as
above. The number of processors in the system is given by p1. The binding relation is HB =
2+. This means that the information transfer between the processors is completed through
the network, and the network is based on the 2nd memory hierarchy – the private main
memory of every processor. For example, for the Dawning D1000A [8]p0 = 2p1 = 8. For
6
Dawning 2000-1, p0 = 2p1 = 32.
The clusters of SMPs (abbreviated as CoSMPs in this paper) using fast networks, such as
the Myricom's Myrinet [9], have emerged as important platforms for high performance
computing [10]. CoSMPs has two levels of parallelism. The hierarchy memories consist of
caches, main memory and hard disk. Thus, its HPM model can be denoted as HPM(2,3,hB),
and HP = p0,p1,p2, where p0 is the parallelism inside a cpu. p1 is the parallelism of an SMP
sub-system. p2 is the parallelism on the cluster level, that is the number of SMP sub-system in
the whole system. Here we call an SMP sub-system as a “node”. The total parallelism of the
system is pp1*p2. The binding relation of CoSMP is expressed as HB22. The
information transferring between the sub-systems of the first level of the parallelism (inside
an SMP) is through the second level of the memory hierarchy. Each processor in an SMP
exchanges the information with other processors by writing (or reading) the information to (or
from) the common shared memory. The information transferring between the sub-systems of
the second level of the system (on the cluster level – between the SMP nodes) is completed
through the network, and the network is based on the 2nd memory hierarchy – the private main
memory in every nodes. For example, the Dawning D3000 [8] is a CoSMPs system. It consists
of multiple workstation nodes. Each node includes a quad-processor IBM Power3-II
microprocessor. Every processor has double function units. A multiply-add floating operation
can be completed in one clock cycle by one function unit. The interconnection Network -
Myrinet is used for high-speed communication. Thus, we have p0 = 4p1 = 4, and p2nsizes,
nsizes is the number of the nodes in the system. HP4, 4, nsizes. The total parallelism of
the system is p4*nsizes.
2.5 Performance parameters of the HPM model
According to above analyses and paper [12], we can use the decreasing ratio k andk+
to describe the performance decreasing of memory hierarchy in an HPM model. If the peak
performance of a processor in an HPM is denoted as s, the memory access speed of memory
hierarchy k is denoted as Vk, and the performance of memory hierarchy k+ is denoted as V k+,
then we have the following definition.
Definition 1 The decreasing ratio k is defined as the ratio of the performance of memory
hierarchy k to the peak performance of a central processor of the HPM, and k=Vk/s. The
7
decreasing ratio k+ is defined as the ratio of the performance of memory hierarchy k+ to the
peak performance of the central processor of the HPM, and k+=Vk
+/s.
Obviously, 0 <k 1, 0 <k+ 1. k
+ k, k+1 k, and (k+1)+ k
+.
The performance parameters of the HPM model can be summarized as follows:
P: number of processors;
s: peak performance of a processor in the HPM;
k: decreasing ratio of the performance of memory hierarchy k to the cpu;
k+
: decreasing ratio of the performance of memory hierarchy k+ to the cpu.
The real performance of an HPM algorithm can be analyzed by the parameters (p, s, k,k+)
and the characteristics of the HPM algorithm.
3. HPM algorithms
A parallel algorithm on an HPM(p, hm, hB) model consists of p independent
sub-algorithms, each running on a processor of the HPM system, and “communicated” by
some rules. Each sub-algorithm is a RAM algorithm with extra considerations. Here we call
the sub-algorithm as E-RAM algorithm.
3.1 E-RAM algorithms
Enhanced RAM (E-RAM) Algorithm: The traditional RAM algorithm with the
consideration of memory hierarchies and parallelism inside a cpu.
The E-RAM algorithms are different from the traditional RAM algorithms. For example,
on the RAM model, the three computing methods for matrix multiplication (inner, middle and
outer product) are computation equivalent, but on the E-RAM model, the three methods have
totally different computation behavior [6], as they have totally different memory behavior.
3.2 Parallel HPM algorithms
A parallel algorithm on an HPM model consists of p independent sub-algorithms, each
running on a processor of the HPM system, and “communicated” by some rules. Before we
discuss the HPM algorithm, we need to give some related definitions.
Computing (Parallel) interactive point
The Computing (Parallel) interactive point is the starting, ending, and the synchronous
points in a parallel computing.
8
Data interactive point
The data interactive point is the point for data accesses and data communication in a
computing.
The data interactive point can also be distinguished as vertical interactive point or
crossing interactive point.
Vertical interactive point
The vertical interactive point is the point of data accesses to any memory hierarchies with
the level k.
The examples are the general memory accesses, such as cache, main memory and disk
accesses.
Crossing interactive point
The crossing interactive point is the point of data accesses to any memory hierarchies
with the level k+.
One example is the communication points through network (memory hierarchy 2+) on the
cluster system or CoSMP system. Sometimes we call crossing interactive point as
communication interactive point, and vertical interactive point as memory interactive point.
Here we call all the above points as interactive points. Thus, a parallel HPM algorithm
can be defined as follows.
Parallel HPM algorithm
A parallel HPM algorithm consists of a set of p independent E-RAM sub-algorithms, each
running on a processor of an HPM computer. These sub-algorithms are interacted through
some interactive points (including the computing (parallel) interactive points and data
interactive points).
3.3 Examples of HPM algorithm
A sequential algorithm is the E-RAM algorithm.
An SMP algorithm is a set of p independent E-RAM sub-algorithms, each running on a
processor of an HPM computer with the SMP type. These sub-algorithms are interacted
through some computing (parallel) interactive points (such as “FORK”, “JOIN” primitives for
SMP system) and vertical interactive points (for memory accesses).
A cluster algorithm is a set of p independent E-RAM sub-algorithms, each running on a
9
processor of an HPM computer with the cluster type. These sub-algorithms are interacted
through some computing (parallel) interactive points (such as “mpi_init”, “mpi_finalize”
primitives for cluster) and crossing interactive points.
A CoSMP’s algorithm is a set of p independent E-RAM sub-algorithms, each running on
a processor of an HPM computer with the CoSMP’s type. These sub-algorithms are interacted
through some computing (parallel) interactive points (both for cluster and SMP system),
vertical interactive points (for memory accesses) and crossing interactive points (for
communication).
4. Performance analyses of HPM algorithms
For an HPM algorithm, the programming for computing (parallel) interactive point and
crossing interactive point needs special parallel primitives. Even the programming grammar
of E-RAM algorithm and vertical interactive points are the same as that for general traditional
sequential algorithm, their performance may be different. The performance of an HPM
algorithm depends on the E-RAM algorithm, the parallelism behavior of the algorithm, and
the influence of the computing (parallel) interactive points and data interactive points. The
real performance of the algorithm will depend on both of the architecture and algorithm
characteristics.
4.1 The performance of E-RAM algorithms
According to above definition, a sequential algorithm on an HPM system is an E-RAM
algorithm. It is obvious:
Lemma 1 If the timing of a RAM algorithm for solving problem A on a computer with p0
= 1 is TRAM, then the timing of the E-RAM algorithm for solving the same problem A on the
computer with p0 will be TE-RAM, and TRAM / p0 < TE-RAM < TRAM.
If only the computing complexity is considered, the idea execution time of an E-RAM
algorithm for algorithm A (with computing complexity W) on a processor of the HPM with
sequential peak performance s will be T1(t)=W/s.
Accounting to the characteristics of algorithm A utilizing the parallelism inside a cpu
(presented by p0 ), and the influence of the data interactive points, the real execution time can
be expressed by the following equation:
10
T1(r) = T1
(t)+T1(P0)+T1
(m) = T1(t)0
(P0)0
(m)).
Where 0(P0) = T1
(P0)/ T1(t), and 0
(m) = T1(m) / T1
(t). 0(P0) !0
(m) > 0, and
T1(r)>T1
(t). The 0 (P0) is the influence factor of parallelism inside a single cpu and the algorithm
behavior to utilize the parallelism, If the parallelism is well utilized, 0(P0)
"1. 0(m) is the
influence factor of memory accesses and the algorithm memory behavior. Smaller values of
0(m) means higher memory accesses efficiency.
Thus, the real performance of a sequential HPM algorithm can be expressed by S(r) =
W/T1(r)# $0
(P0)0
(m)%& ' ( 0T1
(t)/ T1(r) =
$0(P0)
0(m)), )0 depends both on the optimization characteristics of the algorithm
(memory accesses) and the architecture.
Here we discuss the influence of the memory interactive points. For an algorithm A with
computing complexity W, we use the following definitions:
Space complexity[11]M0: The necessary memory space for solving algorithm A.
Memory accesses complexity M1: the amounts of real memory accesses for solving
algorithm A on a real computer.
1: The ratio of the *(*+ **+ ,1=W/M0.
: The ratio of the space complexity with memory accesses complexity of A: = M0 / M1.
2: The ratio of the computing complexity with memory accesses complexity of A:
2 = W1 / M1.
: & * * *+ , = Cc/M0.
This value is related with the cache hit-rate. Large value of means the possibility for higher
cache reuse rate.
If the size of an algorithm is n, then the values of W, M0 and M1 will all be functions of
1 and 2. W1 and M0)1) are the intrinsic attribute of an algorithm,
no relevant with the implementation of the algorithm on special computer. However, the M1
does not only depend on the characteristics of the algorithm, but also relate with the
implementation style of the algorithm on special computer.
In general, we have1. If equal to 1, the algorithm cannot be improved from the
memory access point of view, however it does not mean that the algorithm has high memory
%&(*** 1=W1/M0 = O(ni), i > 0. In most cases,
11
the bigger of i, the bigger the reuse rate of algorithm A, and the higher memory accesses
efficiency. Usually, s used for the comparison of memory accesses efficiency of different
*(%- "1)1 is a higher order function of
n, the algorithm can have higher memory accesses efficiency.
For the algorithm of multiplication of two nn matrices , W = 2* n3, M0 = 3 * n2. The
“inner product” (denoted as (i,j,k)), “outer product” (denoted as (k,i,j)), “middle product”
(denoted as (j,k,i)) and the “blocking” methods are different implementation methods. If these
methods are programmed in FORTRAN on a sequential computer, the values of M1)2
can be listed in Table 1.
Table 1. The performance of matrix computing Algorithm M1 2
(ijk) (n1+1)n3+n2 3 / (n1 + (n1+1)n) 2n / (n + (n+1)n1)
(kij) n1 n2 + 2 n1 n
3 3 / (n1 + 2 n1 n) 2n / (n + (2n+1)n1)
(jki) n1 n2 + 2 n3 3 / (n1 + 2 n) 2n / (2n+n1)
blocking 3 n0 n2 1 / n0 2n / (3n0)
Here nl 1 is the length of the cache line (in words), nn0*nb. In the blocking matrix
multiplication, the matrices AB and C are divided as sub-matrix, each with size nbnb, and
there are n0n0 sub-matrices in each matrix. When a block can be totally kept in the caches,
the values will be improved further.
. + * 1=W1/M0=O(n), it should have good memory accesses
efficiency. However, the values of for different implementation methods (inner product,
outer product, middle product and blocking) are different. The value of for blocking
method is the nearest to 1. This means that the blocking method has the best memory accesses
performance among the four implementations. This concept is tested on one processor of
Power3-II. The practical results agree with our principles here. The testing results are given in
the paper [6].
4.2 Some assumptions
For simplicity and without loss of generality, we make the following assumptions for
parallel HPM algorithms.
If the parallel granularity is rather large, the timing of the computing interactive points
can be neglected compared with the computing time of the algorithm. Thus, an HPM
12
algorithm can be seen as a set of p independent E-RAM sub-algorithms, each running on a
processor of an HPM computer. These sub-algorithms are interacted only through some data
interactive points. If the timing of ith E-RAM sub-algorithm is Ti (1 , then Ti = TERAMi
+ Tipi. Here TERAMi is the computing time on a processor of the HPM system. Tipi is the timing
for the data interactive points. Thus, the timing of this HPM algorithm will be Tp =1 Ti,
or Tp =1 Ti. We will use the late one in this paper.
Thus, an HPM algorithm can be considered as p independent E-RAM sub-algorithms,
plus some data interactive points. The performance of an HPM parallel algorithm depends on
two kinds of factors: the parallelism of the algorithm with the overhead of the computing
parallelization and the overhead of the data interactive points. According to this principle, the
performance of HPM algorithms will be discussed in the next sub-section.
4.3 The performance of HPM algorithms
Based on the above assumptions, the performance of an HPM parallel algorithm mainly
depends on two kinds of factors: The parallelism of the algorithm, overhead of the data
interactive points: vertical (memory) interactive points and crossing interactive points.
In the theoretical ideal case, the computing time for an algorithm on the HPM will be Tp(t),
Tp(t) = W/(s*p), and the theoretical speedup will be Sp
(t) = p. For a real HPM algorithm, the
computing time will be influenced by the overhead caused by algorithm parallelization,
vertical (memory) interactive points and crossing (communication) interactive points. Thus
the real computing time can be expressed as: Tp(r) = Tp
(t) + Tp(c) + Tp
(k) + Tp(k+) = Tp
(t) *(1+ p(c)
+ p(k) + p
(k+)). Here the Tp(c) is the overhead caused by the algorithm parallelization, Tp
(k) is
the overhead for the vertical interactive points and Tp(k+) is the overhead for crossing
interactive points, and p(c) = Tp
(c) / Tp(t), p
(k) = Tp(k) / Tp
(t), p(k+) = Tp
(k+) / Tp(t).
4.3.1 The algorithm parallelism and computing consistency
If only the computing complexity is considered, the performance of a parallel computing
depends on two factors: the parallelism of the algorithm and the complexity redundancy of the
algorithm A(P) compared with its sequential algorithm A. Thus, the timing of the parallel
computing is: Tp(c) =Tp
(p) + Tpc(o). Here Tp
(p) depends on the algorithm parallelism and system
parallelism. Tpc(o) depends on the overhead of the computing and complexity redundancy.
Thus we have: TP(c) =Tp
(t) *(p(P)pc
(o)) =Tp(t) *p, P is the influence factor of parallel
13
computing, and p = p(P)pc
(o). p(P) = Tp
(p) / Tp(c) , pc
(o) = Tpc(o) / Tp
(c).
In general, p(P) 0. The better the parallelism of the algorithm the smaller the value of
the p(P). pc
(o) is related with the computing redundancy of algorithm. In most cases, we
have pc(o) 0. However, in some special cases, pc
(o) < 0. This may happen in some special
cases of searching algorithms. This is one of the reasons for “super-linear” speedup. We use
the term consistency to describe the computing redundancy of a parallel algorithm.
Consistency: consistency (Cpc) is the ratio of the computing complexity of a parallel
algorithm (denoted as Wp) to the computing complexity of its sequential partner (denoted as
W1).
If the value of Cpc is near to 1 (more general, if it is a constant), we say the parallel
algorithm is consistent with its sequential partner.
If only the computing complexity is considered, the computing efficiency of a parallel
HPM algorithm can be given as:
P(c) = Tp
(t) /( Tp(t) +Tp
(c) )= 1 /(1+p(P)pc
(o)) #$p.
It is only related with the parallelism and the complexity redundancy of the parallel
algorithm.
4.3.2 The influence of vertical (memory) interactive points
As the emergence of parallel computation, the behavior of memory accesses may be
different from the sequential case. Mainly, the memory access performance is affected by the
following factors:
%/*)'* (the ratio of the cache capacity of
(***+ (, = Cc/M0) may increase because of the use
of multiple processors. This leads to the possibility of higher cache reuse rate, and higher
memory accesses performance.
2). For the equation (5): tcj=([nl/lj]* lj)/ Bj + Lj, as the bandwidth is shared, and the
latency is accumulated by multiple processors, this leads to the possibility of lower memory
accesses performance than the sequential case.
From the above analyses, we have: Tp(k) = Tp
(t) * p(k) . In most cases, p
(k) 0. However,
sometimes, we may have p(k) < 0. For example, in the special case: = Cc/M0 increase very
much. This is also one of the reasons for “super-linear” speedup. The parallel LU algorithm is
14
a good example. LU decomposition for solving the set of linear equations has very good
parallelism on the SMP computer. When the size of the matrix to be solved is very large, the
overhead will be very small, and more data of the matrix may be stored in all the caches of all
the CPU’s of the SMP. The “super-linear” speedup may occur. We test this on the SGI Power
Challenge computer with parallelism 4, and the results are given in the table 2.
Table 2. The “super-linear” speedup of LU algorithm
n T1 T2 Sp
256 0.18 0.077 2.34
512 1.37 0.483 2.83
768 5.40 1.660 3.25
1024 24.40 4.668 5.23
Here n is the dimension of the matrix, T1 is the timing for LU algorithm on one processor
of the computer, T2 is the timing for LU algorithm on the parallel computer, Sp is the parallel
speedup, and Sp = T1 / T2. It can be seen that when n is very large, we may get “super-linear”
speedup.
4.3.3 The influence of the crossing interactive points
The effecting of crossing interacting points can be expressed as:
Tp(k+) = Tp
(t) * p(k+) .
In fact, the value of p(k+) is related with the ratio of the overhead of the crossing points
with the parallel computing complexity of the algorithm. It depends both on the
characteristics of the parallel algorithm and the implementation strategy of the algorithm on
an HPM computer. The implementation strategy of the algorithm on an HPM computer is
mainly depends on the structure of the memory hierarchy with the level k+ and the data
partition strategy for the algorithm on the HPM computer. It is obvious that when we solve a
parallel algorithm the characteristics of the parallel algorithm structure of the memory
hierarchy with the level k+ on the HPM computer is relatively fixed. The partition strategy is
critical to the performance of the algorithm.
To analyze the influence of communication interactive points, for a section of an
algorithm or a complete computation (abbreviated as section), for example, a loop body, or an
independent section of a program, we use the following ideas.
15
Definition modified array
The modified array is defined as the left hand array of an assignment statement in the
loop body.
Definition relevant array
The right hand array B of a statement in the loop body is called the relevant array of the
modified array (A). If array C is implicit related with array B, it is called implicit relevant
array of A. If A is also appeared in the right hand side then it is called self relevant.
Definition relevant elements
The elements of relevant array B, which is related with the modification of an element
A(i1,i2,…,ik) of a modified array A, are called the relevant elements of A(i1,i2,…,ik). The total
number of the relevant elements is called the relevant number of A(i1,i2,…,ik).
Definition relevant distance
If an m-dimension array A is a self relevant array, then the relevant distance of an
element A(i1,i2,…,im) (1ijnj) on the j-th direction is defined by the following equation:
d(j)(i1,…im)=|S-ij|, where A(i1,i2,…,S,…,im) is the relevant elements of A(i1,i2,…,ij,…,im) in
the j-th direction.
Theorem 1
If array A is not included in its relevant or implicit relevant arrays, then there exists a
partition of array A, such that, there is no communication for array A when parallel the loop
body.
However, we need to point out that the parallel penalty is the extra memory space for the
computation. Thus, for the computation with large memory, this method is not practical.
Theorem 2
If the implicit relevant distance of the array in the i-th direction is equal to 0, then the
partition of array A along the i-th direction may lead to a parallelism without communication.
We omit the proof here. The most general parallel method of the computation is the data
partition, especially for arrays. For iteration computing, not only the communication behavior,
but also the number of iterations and amounts of computing in every step depends on the
partition style. When the communication cannot be avoided, the following principles should
be used for higher communication performance.
16
A good partition of array A should satisfy the following conditions: 1) the ratio of
computing complexity and communication is bigger; 2) the total amounts of communication
is small; 3) The total number of communication operations are small.
For simplicity and without loss of the generality, we only discuss parallel sections here.
The influence of crossing interactive points depends on the array partition strategy. It can be
analyzed from several cases below.
CASE 1. No data communication, but may need multi-copies of data.
This case is happen when all level of the loops are parallelizable, and the distance is
zero. For example, the matrix multiplication can be parallelized in any level of the loops, and
the distance is zero. If the memory space is big enough, we can parallel the algorithm by the
direction i of the array A and C, copy array B in every computer, no communication is needed
in this computing.
CASE 2. Data communication on the boundary
If the data communication is needed only in one direction, and some other directions can
also be parallelized, then the communication may be avoided in this computation. In this case,
we only need to parallel another direction, and partition the array in strip. If the data
communication is needed in every direction, we can partition the array in strip or square.
CASE 3. Data communication in different direction
The computation can be divided in several (for example two) sections, the data
dependency is happened in different direction, and communication is also in different
direction. If we partition the arrays and parallel the computation in a fixed direction, the
communications may be very high. 2-d FFT and ADI computation are such examples. One
possible choice for this case is parallel the computation and partition the array by different
strategies, and use necessary data re-displacement between the sections of the computation.
CASE 4. Higher data reuse rate
When the data reuse rate is high in a computation, the block method can be used for high
memory performance in every hierarchy. The Linpack and matrix multiplication with
blocking are good examples.
CASE 5. Macro pipeline parallel strategy
In this case, the vertical-horizontal partition can be used for higher performance [6].
17
We discuss these related topics and examples in another paper [6].
4.3.4 The efficiency of parallel HPM algorithms
If all the above factors are considered, the timing of a parallel computing is:
Tp(r) = Tp
(t) + Tp(c) + Tp
(k) + Tp(k+) = Tp
(t) *(1+ p(c) + p
(k) + p(k+)).
The real efficiency of a parallel HPM algorithm can be given as
P(r) = Tp
(t) / Tp(r) = 1 /(1+ p
(c) + p(k) + p
(k+)).
It depends on the parallelism of the algorithm, overhead of the parallelism, overhead of
the computing complexity redundancy, overhead of memory interacting points and crossing
interacting points.
5 Conclusion
The parallel computing model HPM is introduced and evaluated in this paper. This model
describes the general homogeneous parallel computer systems with hierarchical parallelism
and hierarchical memories. Some commercial parallel computers are illustrated according to
the HPM model. For a general HPM computer, the parallel algorithm is defined. The
performance of the parallel algorithm is also analyzed. From this discussion, we can see that
this model is good for analyzing the scientific computing algorithms.
References:
[1] Claudia Leopold, “Parallel and Distributed Computing”, John Wiley & Sons, 2001
[2] C. Xavier et al. “Introduction to Parallel Algorithms”, John Wiley & Sons, 1998
[3] Valiant L G. A bridging model for parallel computation. Communication of the ACM, 1990,
33(8):103~111
[4] Culler D et al. LogP: Towards a realistic model of parallel computation. In: Proc. of the ACM
SIGPLAN Symp. on Principles & Practices of Parallel Programming, 1993: 1 ~ 12
[5] Alpern B et al. Modeling parallel computations as memory hierarchies. In: Proc. Programming
Models for Massively Parallel Computers. Sept. 1993
[6] Xiangzhen Qiao, et al. Algorithm optimization on HPM model. Submitted.
[7] D.E.Culler, J. P. Singh, and A.Gupta, Parallel Architecture: A Hardware/Software Approach, San
Francisco: Morgan Kaufmann, 1999.
[8] NCIC, Cluster Dawn3000, NCIC, Beijing, 20021.
18
[9] N. J. Boden, et. al, “Myrinet-A Gigabit-per-Second Local-Area Network”, IEEE Micro., Vol.15,
February, 1995, pp. 29-38.
[10] Gordon Bell & Jim Gray, High performance computing: Crays, Clusters, and Centers, What Next?
August 2001, MSR-TR-2001-76.
[11] A V. Aho etal. “The Design and Analysis of Computer Algorithms”, Addison-Wesley Publishing
Company, 1976
[12] Xiangzhen Qiao, et al., “Double-stream” analysis for the performance of parallel computations,
Computer Science, Vol 28 (10), Oct. 2001, pp7-12.
A Visual Environment for Specifying Global Reduction Operations
Robert R. Roxas and Nikolay N. MirenkovGraduate School Department of Information Systems
University of Aizu, Aizuwakamatsu City, Fukushima, 965-8580 Japanemail: d8032102, [email protected]
Abstract
Parallel computing is currently held as the most efficient way for making programs to solve scientific application prob-lems. It is because scientific application problems usually involve the manipulation of a very huge volume of data. Thisactivity makes sequential computing not a very useful approach. However, making parallel programs is not an easy taskfor programmers. They need to have a good tool or environment wherein they can easily specify what they want to solve.Because of the complexity in making parallel programs, visual programming is a very promising approach in this area. Infact, there were already several attempts to provide a visual environment for this purpose but only achieved a little success.One of the probable reasons for such a phenomenon to exist is the visual representation of the programming constructs beingemployed. The commonly used approach is to create or draw some graphs, which will then be annotated by attaching someprogramming codes using the conventional text-based style.
Because of the important role that parallel programming plays in scientific computing, it is necessary that a more simpleand effective way of specifying computational activities in a parallel program be provided. In order to achieve this idealgoal, we also take this into account in the new visual programming environment that we are developing. Our new systemallows the creation of programs from algorithmic “film” specifications, where the use of text is minimal. In this system,a language of films and a language of micro-icons are used depending on the nature of specifying the needed operation.This paper discusses how to specify global reduction operations using a language of micro-icons and shows how the visualprogramming environment supports the manipulation with these micro-icons.
Keywords: global reduction, master/slave, parallel computation, visual environment, algorithmic film
1. Introduction
Parallel computing is currently held as the most efficient way for making programs to solve scientific application problems.It is because scientific application problems usually involve the manipulation of a very huge volume of data. This activitymakes sequential computing not a very useful approach. A lot of researches have been done to come up with good parallelalgorithms to achieve the goal of having efficient parallel programs for a particular domain. However, it can not be deniedthat even at present, making parallel programs is still not an easy task for programmers. They need to have a good tool orenvironment wherein they can easily specify what they want to solve. Because of the complexity in making parallel programs,visual programming is a very promising approach in this area.
Visual programming approaches come in different forms all of which aim to provide a language that is easy to understandand an environment where making programs (parallel or sequential) is made easy. Researches in this area started more than 10years ago but with a limited success and its progress is seemingly slow [21]. There were researches made on visualization forparallelizing programs [8], visualizing parallel and/or concurrent programs [39, 34], visualizing parallel software execution[39], animating or visualizing parallel algorithms [23, 39], visualizing memory access behavior of parallel programs [40],parallel simulations [4], etc. Visual approach is not limited to parallel programming only but for sequential programmingas well. In fact, visual approaches are also being considered in different types of applications such as in object-orienteddatabases and programming [6, 14], SQL databases [5], XML queries [10], multimedia applications [12, 19], rule-basedprogramming [28], etc.
Despite the efforts that have been exerted in visualizing the different aspects of parallel computation, it is evident thatvisual programming especially for making parallel programs has not fully matured yet. One of the probable reasons forsuch a phenomenon to exist is the visual representation of the programming constructs being employed. The commonlyused approach is to visualize computation using graphs [20, 2, 27]. In making programs, the programmer usually creates ordraws some graphs, which will then be annotated by attaching some programming codes using the conventional text-basedprogramming style. This may be the reason for using the term graphical programming [21, 2, 25] instead of just using visualprogramming. In the graphical approach, the nodes represent some codes that will perform some computation and the arcsrepresent the flow of data and/or computation. In some existing languages, the arcs are called wires [21, 42]. The use ofgraphs is useful and easy to understand in some situation especially for small programs but when a program gets large,difficulty arises because of the limitation imposed by the size of the computer screen.
Because of the important role that parallel programming plays in scientific computing, it is necessary that a more simpleand effective way of specifying computational activities in a parallel program be provided. In order to achieve this idealgoal, we also take this into consideration in the new visual programming environment that we are developing. This newsystem allows the creation of programs from algorithmic “film” specifications [22, 41, 44, 45, 9], where the use of text isminimal. Although this system also employs visual programming, it doesn’t use graphs as the primary means of specifyingand viewing the different parts of a program. Algorithmic films of a multiple view format are used in which the different partscan be specified and/or viewed in a non-linear order according to user’s demands. In this new programming technique usingsuch films, a programmer can browse, watch, edit, and specify or attach operations to film frames inside some panels. Whencreating a program, he will just be combining some films from a database suitable to his needs and updating the operationswithin the chosen films. The source code will be generated automatically by the system using the templates from a database[9]. In generating the source code, the C language (for non-MPI related commands) and the C version of MPI (for MPI-related commands) are used. In this visual environment, six subsystems are being used to implement the six different groupsof frames related to the different views that the system offers to a programmer. The goal of these different views is to allowa programmer to understand better his programs from different points of view. In the past, the idea of using animation inmaking programs was restrictive and reduced to straightforward and often boring visualization of some intermediate resultsof computation, see [39].
One of the six subsystems in this visual environment is a visual or graphical interface for specifiying I/O operations forsoftware components and communication between components. The specification of communication among processes in aparallel program is just a partial case of the latter. Thus we don’t show or discuss the main visual environment here. Interestedreaders can find information about this in [22, 44]. What we present here are the specification panels related to specifyingglobal reduction operations within this visual environment. But because these panels are displayed and manipulated withinthe main visual environment, we consider these as the visual environment for specifying the global reduction operations. Ingeneral, its objective is to come up with a tool where programmers don’t have to memorize a lot of things in order to use thesystem. This is one of the factors why we deem visualization as an important programming tool. Some may say that realprogrammers don’t really need to memorize programming commands, like MPI global reduction operations, because theycan easily consult the users’ manual. This maybe true for some people but maybe not for other people. As can be observed,a lot of computer users shifted from using the DOS environment to the Windows environment. There must be some reasonsfor this phenomenon. In the DOS environment, one has to know a lot of commands with their parameters and switches toeffectively use the operating system. On the other hand, these things are not necessary when using the Windows environment.Even for the Unix system, many people developed some graphical environment to enhance the pure text-based system andthere are many applications running in Unix systems which are graphical. If a visual or graphical environment is considereduseful for computer users, then visual programming environment will be very useful for programmers, especially that makingparallel programs is a difficult task.
To illustrate the features of this subsystem, the specification of the global reduction operations based on the master/slavescheme of computation is presented. Any of these operations is attached to a master/slave film using a language of micro-icons. Our approach is not just limited to global reduction operations, other types of collective communication operations arealso being considered and are found in [30]. We are trying to provide micro-icons that are intuitive enough for understandingeven for a novice programmer. The number of micro-icons can grow as new features are added to the system because they arean open set. When there are too many micro-icons in the system, the possibility is anticipated that a programmer will needsome aid in understanding the different micro-icons. Thus a system help which comes in three forms: animation, sound, andtext is included in the system.
In this paper, there are no novel algorithms introduced to improve the current state of problem-solving, especially forscientific computing. Also, it does not present any experiments with existing systems. What is described here is just a part of
a visual programming environment that aims at making programming, especially for scientific computing, a bit easier. Whatis presented here is just an approach that aims at providing a visual support for the existing MPI library and the examples usedare limited to the specification of global reduction operations only. In Section 2, some works which have some relations to thisresearch are described briefly. These works have influenced this current research in some way and formed the fundamentalbasis for a quest for a new visual programming paradigm and language. In Section 3, an overview of the visual programmingenvironment called “Active Knowledge Studio” (AKS) [22, 44, 45, 41, 9] is given. Then the specification of the differentglobal reduction operations is illustrated in Section 4 . In Section 5, the discussion about source code generation is given.Finally, this paper is concluded in Section 6 with statements about the future direction of this work.
2. Related works
A lot of researches have been done to provide an environment where making (sequential or parallel) programs will be madeeasy. Even at present, many researchers are still studying on how to make programs from visual symbols in order to facilitatesoftware constructions. These people believe that such an approach will make the construction of programs easy. This shouldbe considered as an ideal goal, especially for parallel programming, because of the complexity in making parallel programs.The following paragraphs describe briefly some of the systems that use visualization as a technique to make parallel programseasy to create and their computational activities easy to understand. In genereal, the common technique that they used is tomake use of graphs where nodes signify some computations and arcs signify communication paths. One can consider thistechnique as partial in the sense that they are just used to visualize the flow of computational process. In reality, it is justthe outward appearance of a program that they are trying to visualize. The programmer will still go back to text-basedprogramming usually using the syntax (or a close variant) of the programming language used to run the program.
HeNCE [2] is a graphical parallel programming environment intended to simplify the construction of parallel programs.In fact, the authors claim that it is a tool that greatly simplifies the writing of parallel programs. When making parallelprograms, the programmer draws some directed graphs. In these directed graphs, the nodes represent calls to conventionalsequential subroutines written in C or Fortran and the arcs represent execution ordering constraints. The program is thenautomatically translated into an executable parallel program which will be run under the PVM [11] system. The graphicalaspect of the system is quite easy to use but the creation of actual codes that would be run by a node is quite difficult. Thisowes to the fact that the programmer has to use conventional programming language again like C or Fortran and the bulk ofthe work is on specifying the program textually. Thus one can consider this system as quite difficult to use and maybe proneto syntactical and/or typographical errors. It is, therefore, ideal if the conventional programming style can be done away withor will be minimized, if not possible to eliminate it.
Another graphical parallel programming environment is the CODE [27, 25] system. The purpose for creating this systemis to provide an environment in which users can create parallel programs by drawing and then annotating a picture. Soprogramming is reduced to drawing pictures and then declaratively annotating them. Depending on the kind of informationneeded to annotate a node, annotating it can be either simple or not so simple. When setting the different attributes of anode, this is done by filling in a form with some text. This is simple but in specifying the code of a node, the code whosesyntax resembles to that of the C language should be input from the keyboard and it needs the help of a text editor. Althoughits syntax is like C, there are additional symbols and their corresponding semantics to memorize. Certainly, this is one stephigher than a pure text-based programming style.
Paralex [1] is described as a complete programming environment that makes extensive use of graphics to define, edit,execute and debug parallel scientific applications. It is a visual environment designed for making parallel programs indistributed sytems. It is aimed at exploring the extent to which parallel application programmer can be liberated from thecomplexities of distributed systems. Its program is composed of nodes and links where nodes correspond to computation(function, procedure, programs) and links indicate the flow of (typed) data. Although one of the components of Paralex is agraphics editor for program development, still the programmer must supply the codes of a particular node of interest. It stilluses a text editor to create or modify the source file or the include file of a given node. The programmer has to know the syntaxof a particular language like C, C++, Fortran, Pascal, Modula, or Lisp. Thus it is just a step higher than a pure text-basedprogramming style, which also maybe prone to typographical and/or syntactical errors and may need memorization.
Visper [38, 37] is a graph-based software visualization tool intended for making parallel message-passing programs. It usesa formalism that combines control flow and data flow graphs into a single visual formalism. Such visual formalism is calledProcess Control Graph (PCG) [36]. Visper has a PCG visual editor that is user-friendly to be used for creating programs.When creating a program, a programmer draws some graphs that show parallel programs or modules. These graphs will thenbe annotated by filling in a set of forms, e.g., ExecuteForm for execute blocks. Execute blocks are considered as the basic
building blocks for assembling the code into a program. They are used to add some codes to a PCG and to describe howsequential components are composed into a parallel construct. The created graphs can be compiled to automatically composethe program. If there are no syntax errors, a text file of the program is saved ready for compilation into C or Fortran. Thisis quite close to what is presented in this paper in the sense that it tries to minimize typing or programming using text as inthe case of filling in a NodeForm. However, in some cases like defining some sequential computation to be executed by the(all or exclusive) execute blocks, they are entered as text just like making modules in a conventional programming language.Thus they claim that programming in PCG is not visual in all aspects [36].
Another visual parallel programming environment for message-passing parallel computing is VPE [26]. It is intended toprovide a simple human interface for creating message-passing programs. When making programs, the programmer specifiesthe parallel structure of his programs by drawing pictures. VPE computations are represented as graphs where nodes representprocesses and arcs represent the paths upon which messages flow from process to process. Each graph contains computation(comp) nodes, which will be annotated with program text expressed in C or Fortran that contains simple message-passingcalls. Although this system provides a visual environment for the creation of parallel programs, it is still not visual in allaspects. The programmer still needs to make codes in text-based approach and has to memorize some syntaxes to be able touse the language. Thus it is still quite difficult to use and maybe prone to typographical and/or syntactical errors as well.
GRAPNEL [15] is a visual programming language designed for creating distributed parallel programs based on message-passing programming paradigm. It is considered as a hybrid language that uses both graphical and textual representation todescribe the whole distributed application. Graphics are used to show those program parts which are important concerningparallelism. Textual descriptions are used to specify some program elements like data declarations and definitions, codesegments without any communication, etc. They are written in C or Fortran. The technique is quite good but there are stillmany things to memorize not only the different graphical symbols used but also the textual syntax to be associated with ordefined for a particular graphic symbol. Thus it is just a partially user-oriented system.
Of course, there are still other systems that also provide a visual environment to make it easy to create parallel programs butthey are normally graph-based variant of what have been described above. Other related visual systems for making parallelprograms include systems like DIVA [29], Meander [43], and VADE [23] to name a few. Just like the ones mentionedabove, these systems do not fully provide an environment where programming will be done without resorting to text-basedprogramming style or by minimizing the use of text in constructing parallel programs. In fact, programming in text-basedapproach plays a major part in the development of software in these systems.
Our visual programming environment uses algorithmic films instead of graphs as the major components of the program.An overview of this system is given in the next section. In general, the use of text is minimized and the use of micro-icons inprogramming is maximized. It also aims to eliminate memorization in making programs by providing intuitive micro-icons.If the intutiveness of the micro-icons is not enough, the system supplies simple to understand system help in the form ofanimation, sound, and text when necessary.
3. An overview of our system called Active Knowledge Studio
“Active Knowledge Studio” (AKS) [22, 44, 41, 45] is our on-going research in the area of “Self-explanatory Component”technology. In this system, six different subsystems are being considered in order to implement the six different viewsrelated to the different features of a software component. These features are related to: 1) the computational schemesused (Multimedia & Skeleton group), 2) the variables and formulas used (Variables & Formulas group), 3) I/O operationsand/or communication (Input/Output group), 4) the integrated view of the algorithm (Integrated View group), 5) the auxiliaryinformation (Registration group), and 6) the links to some other pieces of information (Links & History group). Appropriatepanels will be provided for each of these groups.
In the AKS visual environment, a programmer creates a program simply by combining some algorithmic films from adatabase. In general, he does not create a film from scratch. He can browse some films provided by the system and viewthe desired groups of features. If a film is appropriate for his needs, he can edit it by specifying its size, attaching somevariables to it, defining some computational operations, specifying I/O operations and/or communication, etc. All theseactivities are done within the visual environment. He can combine several films to meet his computational needs. A programmay, therefore, consist of one or more films depending on the complexity of the program. A film may consist of one or morescenes for achieving some tasks. A scene is a set of stills or frames performing some related tasks. Each still is a single stepof a computation and is the smallest item in a film where the programmer specifies or defines an operation.
A typical operation in a master/slave scheme of computation can be used to illustrate when to group the stills into ascene. The slave process usually performs at least three things. It receives data from the master process, performs some
computation, and sends the results of computation back to the master process. These three activities can be considered ashaving three scenes. Scene 1 will be used for defining the operation of receiving data. Scene 2 will be used for specifyingsome computation. Scene 3 will be used for specifying the send operation. Depending on the complexity of an operation ineach scene and the algorithm used, a scene may consist of one or more stills. For instance, if a programmer wants to specify amatrix multiplication using the master/slave scheme, each slave process must first receive a row from the first matrix and theentire second matrix before doing any computation. Since these activities are all receive operations, these can be consideredas one scene, with at least two stills, i.e., still 1 is for receiving a row from the first matrix and still 2 is for receiving the entiresecond matrix.
In the filmification approach, operations are defined in a still. There are many kinds of possible operations that need to bespecified in a still. These operations can be a specification of some formulas for solving a fragment of the entire application,an input operation for initializing the data, output operation after computation, or communication with other processes in aparallel environment, etc. For example, under the Multimedia and Skeleton group, a programmer can specify certain typesof node flashing in a still. He can also select some nodes only or all of the nodes of a still if necessary, see [22, 44, 45]. Thispaper presents some examples of attaching global reduction operations to a still of a film. These operations are under theInput/Output group. A language of micro-icons is used to accomplish the task of attaching this kind of operation to a still.In specifying any of these global operations, the system provides a panel designed for a particular kind of global reductionoperation. The specification panel contains some buttons with example micro-icons that will give the programmer an ideaof the function behind the micro-icons. The programmer will just click on any of the buttons that he thinks necessary to bechanged.
One may wonder if it is possible to specify computation without using text at all. So far it is impossible but can beminimized. Text is still used in this area but its use is minimized. A special kind of text is used, which is considered to be in“enhanced” format. Details about this kind of text are found in [22, 44, 41].
4. Specifying global reduction type of operations
This paper presents how to specify the global reduction types of operation in the AKS visual environment. The mas-ter/slave scheme of computation is chosen as the model in illustrating the examples. In the examples, it is assumed that theprogrammer has already selected a master/slave film from the Multimedia and Skeleton group and wants to specify a globalreduction operation. He can now navigate on the scenes by clicking on a particular micro-icon for a particular scene of inter-est from the display of scenes. The corresponding still(s) of the chosen scene will be displayed inside the area for displayingstills. The programmer must switch to the appropriate still before specifying an operation. The operation defined in a still isapplicable to the selected nodes only. The subsections below illustrate how to specify the different types of global reductionoperations to a still of a film. The desired still should be selected first, otherwise, he may be specifying an operation to thewrong still. The attached operation will always be for the current still.
All specifications discussed in the subsections below are viewable in other modes like animation and tile modes similar tothe examples discussed in [31]. The tile mode allows the programmer to view his set of specifications several frames at a time.This is considered as viewing dynamics in a static way. Other similar panels are discussed in [32, 33] which were modifiedand included in [31]. However, all of them are used to visually specify Input/Output operations of a software component. Butthe examples discussed below are used for visually specifying the global reduction operations in a parallel program using themessage-passing paradigm.
4.1. Reduce operation specification
Figure 1 shows an example specification panel for defining a reduce operation. The panel is supplied with an examplemicro-icon for each button with replaceable micro-icon. This makes the operation specification both easy to use and easy tounderstand. It is easy to use because the programmer has a visual guide at hand. The example micro-icons serve as visualguides. It is easy to understand because the example micro-icons are intuitive enough to convey the meaning of the operationas depicted by the micro-icons. To enhance the intuitiveness of a micro-icon, text is also used. The convention used for thistext is to use italics font when it is part of a micro-icon but normal font when used as a variable name. For example, in Figure1 “BARRIER” is italicized because it is part of the micro-icon while “SBUFF” is in normal font because it is used as the nameof a variable that contains some data. Moreover, the programmer can easily obtain a system help if he is not sure or has somedoubt as to the meaning of a micro-icon. This system help will be discussed later in this subsection. In general, the examplemicro-icons are more or less sufficient, thus only a simple fine-tuning is necessary.
The appearance of the panel generally varies according to the number of necessary buttons to specify the needed operation.But for the four global reduction operations discussed in this paper, all of them have the same appearance. For other typesof collective communication, like scatter and gather, they are slightly different but use a similar approach [30]. Figure 1contains three (3) areas, the center of which is for specifying the main operations.
Figure 1. The reduce operation specification panel.
The first area (left) contains a node indicator signifying the type of nodes [44, 45] to which the operation is defined. Asshown in the figure, the node indicator is a micro-icon that is made up of a set of simple squares. It means that the definitioncontained in the panel is a specification for an activity defined for a group of nodes. This is obvious for any global reductionoperation because the operation is intended for a group of participating processes. The color of the node indicator dependson the color of the node(s) to which the given definition is attached. The programmer must select a node or group of nodesin a still of a film to which a particular operation will be attached. The chosen node(s) will be shown with a certain colordepending on the color chosen by the programmer. During the execution of the program, these chosen nodes will be flashedin some way depending on the activity of a given still. For a discussion of the different kinds of flashing nodes and theirfunction, see [9, 45]. For a definition attached to a single node, a different micro-icon is used. In that case, the micro-iconshows a single square rather than a set of squares. It should be noted that there are two shapes of nodes that the system uses.The first one is the ordinary node represented as a circle and the other one, that calls some functions or films, represented bya square. The system automatically shows this node indicator based on the available information like the type and color ofthe nodes.
The second area (center) of the specification panel is used to define the operation itself. This part is called the MainDefinition Area (MDA). This area contains two non-replaceable micro-icons, i.e., the micro-icons with white background,and two sets of buttons with replaceable micro-icons, i.e., those buttons with gray background. The MDA of the exampleshown in Figure 1 contains two non-replaceable micro-icons and six buttons with replaceable micro-icons. The systemautomatically supplies these two non-replaceable buttons. The top one shows the type of operation that is being used. In thisexample, the micro-icon used is for representing a global reduction operation. It portrays a configuration of a set of processes.As indicated by the arrow, the left part means the initial configuration and the right part is the final configuration after theexecution of the operation. Processes are depicted as small boxes aligned vertically. When processes are involved in a givenoperation, their color is black while those that do not participate, their color is white. As can be seen in the example, the initialconfiguration shows all of the small boxes with a black color. These small boxes colored black represent the processes thatperform an operation that will result to the configuration shown on the right side of the micro-icon. The example micro-iconalso shows that only one process has a black color after the operation. This means that only one of them is involved in theoperation, i.e., receiving the result of computation. From the small boxes on the left, one can see lines connected to a smallwhite box at the center representing the result of the operation that will be sent to a single process colored black on the rightside of the micro-icon. Each line has a small geometrical shape which has a distinct shape. This is so because in a globalreduction operation, different processors have different or heterogenous data. The heterogeniety here refers to the value butnot to the type of data.
Figure 2 shows the different micro-icons for the different types of collective communication using the message-passingparadigm, which of course include the global reduction operations. The four micro-icons at the top (first row), from leftto right, are for the operations: broadcast, scatter, gather, and alltoall types of communication. The four micro-icons atthe middle (second row), from left to right, are for the operations: allgather, reduce, allreduce, and reduce-scatter types of
Figure 2. Micro-icons for message-passing operations.
communication. The four micro-icons at the bottom (third row), from left to right, are for the operations: exclusive-scan,inclusive-scan, left-shift-exchange, and right-shift-exhange types of communication.
The other non-replaceable micro-icon is used to show the type of structure being used. The system also automaticallysets this button because it already knows the type of structure to which this operation is being defined. The example used inFigure 1 shows a master/slave scheme. The master process is represented as a gray colored box which has some connectionsto the different slaves depicted as small boxes with a black color. It also has some connections to the data elements, which areshown at the upper part of the micro-icon. Of course, the presence of this micro-icon is not very important in the specificationpanel under the edit mode because the programmer can see the current structure from the display of stills. Edit mode viewis just one of the views supported by the system. Animation and tile views are supported as well. The system supports otherviews to help the programmer to easily view his set of specifications several frames at a time. When using these other views,the programmer may not see the current structure under consideration. Thus the presence of this micro-icon signifying thestructure to which the specified operation is attached is important in order to help him to recognize easily the structure beingused, especially if there are many types of structures used in a complex system. To balance with what a programmer can seein the edit mode as well as in other modes, the structure is also included in the specification panel while under the edit mode.These other modes and their functions are discussed in [31].
The next discussion will be for the first four buttons with replaceable micro-icons. These micro-icons belong to theSource Definition (SD) group. The upper left button contains a micro-icon showing the name “SBUFF” (send buffer) whosestructure is a 2-D type containing double-precision floating point numbers. If the programmer wants to change the givenexample micro-icon, he can edit this, e.g., to another dimension, structure, or data type. It is assumed that such a datastructure has already been defined prior to the specification of such an operation. It should be pointed out that in MPI [24, 7]the buffers are usually a 1-D array but in real application, other dimensions are also used and manipulated. To allow the useof other multi-dimensional array and to comply with the requirements of MPI, any higher-dimensional data will be savedinto a 1-D array before calling an MPI function. Another solution would be to declare a new data type because MPI supportsit. The upper right button of SD contains a micro-icon showing the type of operator used in this example operation. Themicro-icon shows the commonly used symbol for a summation operation. Other types of operators including user-definedones can be specified in this button but the example is just limited to the summation operator.
The next button found at the lower left side of SD contains a text label “INFO.” This simplifies the specification panelfor message-passing operation. In MPI [24, 7] collective communication such as scatter, gather, alltoall, and allgather, theprogrammer has to type twice the count and type of data to be sent. This is done both at the sender and the receiver side.The rationale for this is to make sure that the amount of data sent is equal to amount of data received. This is also to makesure that the type of data sent and the type of data received are of the same type. Since these pieces of information aredeclared twice, it is better to show just one in the specification panel and automatically supply these pieces of informationin the source code. In fact in most cases, these pieces of information are significant only at the root process. For globalreduction operations, they are slightly different from the collective types of communication mentioned above. But for thesake of uniformity, the same technique is applied in specifying any global reduction operation. This button containing atext label “INFO” specifies the name of the structure that will hold these pieces of information. These pieces of informationinclude at least the type of data and its size or counts, which can be set by the system automatically. This button also includes
the information whether a single process called root or members of a certain group are involved in the specified operation. Incase of a root process, it will also store its rank. In case of a group, it will also store the information identifying the groupcommunicator. Additional pieces of information like the ones discussed in [3, 16] for ensuring the causal order of messagescan also be included depending on the implementation. The system can set most of these pieces of information automatically.These pieces of information are combined to simplify the design of the I/O specification panel in AKS.
The lower right button of SD, specifies a barrier synchronization. The micro-icon shows an image of some synchronizingprocess before (to the left of) the clock image symbolizing time. Since the barrier is to the left of the clock image, it meansthat a barrier synchronization must take place before the operation defined in the specification panel will be executed. It canbe changed to the kind of barrier synchronization where the image of the clock is to the left of the synchronizing processes.In that case, a call to a barrier synchronization must take place after the execution of the specification defined in a givenpanel, e.g., reduce operation. If this is the case, the system is instructed to synchronize all processes in such a way that theslave processes are already free and ready to perform some data manipulation or computation after the given specificationhas been executed.
Figure 3 shows the different barrier synchronization options that can be used. The top row (from left to right) consists of: abarrier that waits for a single process, a barrier that waits for some processes, a barrier that waits for all processes, and a casewhere the processes are to wait for some events. As depicted in the first three micro-icons from the left, the synchronizedprocesses are found before (to the left of) the clock image. This means that a barrier synchronization must be called bythe processes before proceeding to the next operation specified in the specification panel. The first three micro-icons at thebottom row of Figure 3 are almost the same as the corresponding micro-icons found at the top row. The difference lies onlyin the placement of the clock image. The synchronized processes are found after (to the right of) the clock image, whichmeans that synchronization must take place after the operation specified in the specification panel has been executed. Therightmost micro-icon at the bottom of Figure 3 is for specifying no synchronization. A barrier micro-icon uses a vertical linefor representing the synchronization points. The white small circle represents a waiting process and the small black circlesto the left of the synchronization line signify processes that have not reached the synchorinzation points. If the small blackcircles are found to the right of the vertical line, they signify processes that have passed the synchronization line. Other typesof barrier can also be used in different situations. One possibility is to wait for a particular process. This is symbolized by asingle black circle to the left of the synchronization line. Another type is a barrier that waits for a number of processes only.This is symbolized by more than one small black circles to the left of the synchronization line but with a black circle to theright of the synchronization line.
Figure 3. Synchronization options.
The next discussion will be about the Target Definition (TD) group of buttons in the specification panel. For a reduceoperation, what is needed is to specify the receiving buffer and the process to which the data will be sent. The button at thetop shows the name of the receiving buffer, in the example, “RBUFF” together with the structure and its data type. In MPI[24, 7], it is required that the receive buffer be of the same size with the send buffer. The receive buffer is significant onlyat the root. In the example, the same structure is also used as indicated by the upper left button of SD (send) and the upperbutton of TD (receive). Of course for the data type and the size of the structure, it must also be the same as the ones found inthe sender’s buffer. The appearance of the 2-D structure in this micro-icon is just a visual picture. The 2-D data is actuallyconverted to 1-D array behind the scene. The button at the bottom of TD specifies the root process to which resulting datawill be sent after the operation. It contains the text label “ROOT” as the variable identifier to name the receiving process.
As shown in Figure 1, there is no way to specify the type of data and amount of data to be received. These pieces ofinformation are accessable from the data behind the definition of the button with a text label “INFO” found at the lower
left part of SD group. The size of the structure and the type of data it contains are automatically saved in the “INFO”structure. Another pieces of information that will be included in this “INFO” strucuture is the indicator that the participantsare members of a group and the identifier for this group communicator. This is so because in reduce operation all membersof the group participate in the operation and only the process with the rank of the root process receives the result.
The third area (right) of the specification panel contains two functional buttons that are not grouped together. The topbutton contains the text label “Edit.” When this button is clicked, the label changes to “Help.” Thus this button is called“Edit/Help” button. When the “Edit” label is shown, it means that the system is on the “Edit” mode. In this mode, clickingthe buttons with replaceable micro-icons both in SD and TD groups results to showing some options where the programmercan change the example specification. When the “Help” label is shown, clicking any of the buttons in MDA results to showingthe system help in the context of the clicked micro-icon. This includes the non-replaceable micro-icons. This is done so thatthe programmer can still get information even for trivial things. The system help is very handy and easy to understand anduse. It comes in three types: animation, sound, and text. The micro-icon itself is very intuitive but if the programmer wantsto make sure of its meaning or function, he can invoke the system help easily. The help text will be rather short becauseit is limited to the meaning of the micro-icon being clicked. This is aimed at making programming rather easy because nomemorization is required. This approach is geared towards the goal of providing self-explanatory components. In fact this isjust a very small portion designed in order to achieve self-explanatory components.
The button below the “Edit/Help” button is used to record the specification found in the panel. This is called the“Set/Reset” button because it toggles between “Set” and “Reset” buttons when clicked. When the system is in the “Set”mode, clicking this button means the programmer wants to save his specification and the corresponding source code will becreated automatically. This aims to make programming easy without resorting again to text-based programming style. Whenthe button is in the “Reset” mode, all buttons in the SD and TD groups with replaceable micro-icons are not active and cannotbe replaced. This means that the “Edit” mode has been disabled. Of course, the “Help” mode is enabled and it is still possibleto view some animation, hear sound, or read text.
Figure 4 shows the corresponding source code of the specification shown in Figure 1. It contains some declaration oflocal variables and some code portions are not shown like the initialization and finalization commands when using MPI. A2-dimensional structure is saved to a 1-D structure by calling an extra function Convert2Dto1D(). It’s also possible to createa new type because MPI has a function to do that. After converting the data, it is passed to the MPI_Reduce() call. Butsince the specification in Figure 1 defines a barrier synchronization which is the one that must synchronize all processesfirst before performing the reduce operation, a call to a barrier synchronization is issued first. If Figure 1 defines a barriersynchronization where the image of the synchronizing processes is found after (to the right of) the image of a clock, then thebarrier synchronization must be called after the reduce operation. If no synchronization is specified, there will be no call to abarrier synchronization. As for the parameters of the MPI_Reduce command, the information about the number of elementsin the send buffer as well as its data type are retrieved from the structure named “INFO.” The same is true for the informationregarding the “COMM” communicator that indicates the group of processes that participates in the desired operation. Thesepieces of information are not included in the specification panel shown in Figure 1. The reason for this was already discussedabove. As for the setting of the values of the structure named “INFO,” it is done before converting the source data from 2-Darray to 1-D array. Its location is indicated by the ellipsis mark before the call to Convert2Dto1D().
4.2. AllReduce operation specification
Another type of global reduction operation is the allreduce operation. This is basically the same as the reduce operationexcept that instead of one process, i.e., the root process, receiving the result, all member processes in the group also receivethe same result. In this case, there is no need to specify the root process because all processes will receive the result. Figure5 shows an example specification for allreduce operation. It is more or less the same in appearance with the one shown inFigure 1 except for a few micro-icons found in the panel.
The discussion here will be focused on the Main Definition Area (MDA) only because the other parts of the specificationpanel have the same function as discussed above. The top non-replaceable micro-icon found in Figure 5 is different fromthat of Figure 1 because another type of global operation is being specified. As shown in Figure 5, the final configuration,i.e., represented as small boxes arranged vertically at the right side of the micro-icon, shows all processes are colored black.Besides this indicator, one can also find some lines from the result, shown at the center of the icon, connected to each of thesesmall boxes on the right with a black color. This means that all processes will receive the result of computation defined inthe specification panel. The final configuration shows that after the operation is done, all processes receive the result. Thelower right button specifies a barrier synchronization. This time the barrier will be called after the specification found in the
...MPI_Comm COMM;int X_dim, Y_dim;int gsize, ROOT, myrank;double *SBUFF, *RBUFF;...MPI_Comm_size(COMM, &gsize);MPI_Comm_rank(COMM, &myrank);
SBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));RBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));...convert2Dto1D(&sbuff, X_dim, Y_dim, &SBUFF);
MPI_Barrier(COMM);MPI_Reduce(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, ROOT, INFO->COMM);...
Figure 4. The reduce operation source code.
Figure 5. The AllReduce operation specification panel.
panel has been executed. This is indicated by the position of the image of the synchronizing processes. It is placed after (tothe right of) the image of the clock. Compare it with the one specified in Figure 1. The next discussion will focus on theTarget Definition (TD) group of MDA. The button at the bottom specifies the group of processes that will receive the resultof an allreduce operation. In this button, instead of specifying the root process as the receiver of the result, a communicatoris specified.
Figure 6 shows the corresponding source code of the specification shown in Figure 5. This source code is very similar tothe one shown in Figure 4. In fact there are only two noticeable differences. One is the placement of the call to MPI_Barrier()which is found after the call to MPI_Allreduce(). This is because a different kind of barrier synchronization is specified inFigure 5. Another difference is the parameter of the call to MPI_Allreduce(). It does not include “ROOT” as one of theparameters.
4.3. Reduce-scatter operation specification
The next global reduction operation to be discussed is the reduce-scatter operation. This is similar to the two globalreduction operations discussed in the previous subsections. The difference lies on how it handles the result of computation.Instead of sending the data to a process (reduce) or the same data to all processes (allreduce), it divides the data and scatter
...MPI_Comm COMM;int X_dim, Y_dim;int gsize, ROOT, myrank;double *SBUFF, *RBUFF;...MPI_Comm_size(COMM, &gsize);MPI_Comm_rank(COMM, &myrank);
SBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));RBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));...convert2Dto1D(&sbuff, X_dim, Y_dim, &SBUFF);
MPI_Allreduce(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, COMM);MPI_Barrier(COMM);...
Figure 6. The AllReduce operation source code.
the different parts to all member processes of a certain group. Figure 7 shows the specification panel for specifying a reduce-scatter global operation. Again the discussion will be about the Main Definition Area (MDA) because this is where this paneldiffers from the other panels shown so far.
Figure 7. The Reduce-Scatter operation specification panel.
The top non-replaceable micro-icon shows the micro-icon that will be automatically displayed once the reduce-scatter typeof operation is chosen. This button is very similar to its counterpart shown in Figure 5. The first noticeable difference is theportrayal of the results of computation. Instead of a single small white box to represent a singular result, a set of small whiteboxes arranged vertically is used. Each box in this set represents the data that will be sent to a process. So there are as manysmall boxes as there are many member processes in a group. Another difference is the small geometrical shapes attachedto each line connecting the result and the receiving processes. Instead of using white squares for all lines, a distinct shapewith different colors is used for each line. This is to show that the result of computation is scattered rather than broadcast. Inscattering data, distinct data is sent to each member process in a group while in broadcasting data, the same data is sent to allmember processes of a certain group.
The next discussion is about the Source Definition (SD) group of MDA. Comparing the specification panel shown inFigure 7 with Figures 1 and 5, one can observe just one obvious difference. It is the micro-icon found at the lower rightbutton of SD. In this example specification, no barrier synchronization is used. It is indicated by the text “NONE” and theabsence of the image of the synchronizing processes. What is found is just the image of a clock. This is just to illustrate how
to specify no barrier synchronization. It should be noted that this button initially has an example barrier synchronization bydefault. The programmer can change the default micro-icon by clicking on this button. The system will supply the possiblemicro-icons shown in Figure 3 after clicking this button for him to choose. In the Target Definition (TD) group, the topbutton contains the micro-icon for the receive buffer. As can be seen in the specification panel, it shows just a slice of a 2-Dstructure. In Figures 1 and 5, the receive buffers are both a 2-D structure but in Figure 7, it is just a 1-D structure. In this typeof operation, since the result is split into n parts, the receive buffer is visualized by defining a smaller receive buffer than thecorresponding send buffer.
Figure 8 shows the corresponding source code that will be generated by the system. This is very similar to the previousexample source codes except for a few things. As can be observed, there is no call to MPI_Barrier(). This is expectedbecause there is no synchronization barrier that was specified in Figure 7. Another difference is the definition of the receivebuffer “RBUFF,” which is a 1-D structure instead of a 2-D structure.
...MPI_Comm COMM;int X_dim, Y_dim;int gsize, ROOT, myrank;double *SBUFF, *RBUFF;...MPI_Comm_size(COMM, &gsize);MPI_Comm_rank(COMM, &myrank);
SBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));RBUFF = (double *) malloc(X_dim*sizeof(double));...convert2Dto1D(&sbuff, X_dim, Y_dim, &SBUFF);
MPI_Reduce_scatter(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, COMM);...
Figure 8. The Reduce-Scatter operation source code.
4.4. Scan operation specification
Scan operation is a variant of the three global reduction operations discussed above. This operation is often used in parallelprefix computation as mentioned in [17, 35, 13, 18]. In this paper, this operation is also called prefix sum operation. Thespecification panel for this operation is shown in Figure 9. Comparing this with Figure 5, there are only two buttons whosemicro-icons are different. One is the top non-replaceable micro-icon and the other is the lower right button of the SourceDefinition (SD) group.
The top non-replaceable micro-icon shows the type of operation that is being defined in the given specification panel.Among the four different micro-icons used to represent the different types of global operation discussed and illustrated in thispaper, the micro-icon used in Figure 9 is very different. As can be observed, the micro-icon depicts the accumulation of data.If the micro-icon is viewed from top to bottom, where topmost level represent the lowest numbered process, one can see theaccumulation of data. Process 0 stores to its receive buffer the result, which is just the data from its send buffer. Process 1combines its data with the process 0 using the specified operator and stores the result to process 1’s receive buffer. Process 2combines its data with the data from processes 0 and 1 using the specified operator and stores the result to process 2’s receivebuffer. In general, process i combines its data with the data from process 0 to process i-1 using the specified operator.
Figure 10 shows the corresponding source code for the specification defined in Figure 9. It is very similar to the one shownin Figure 6 except for a few lines, particularly, the last two lines. The last two lines differ as to the placement of the call toMPI_Barrier(). Of course, the type of global reduction call and its parameters differ.
Figure 9. The Inclusive Scan operation specification panel.
...MPI_Comm COMM;int X_dim, Y_dim;int gsize, ROOT, myrank;double *SBUFF, *RBUFF;...MPI_Comm_size(COMM, &gsize);MPI_Comm_rank(COMM, &myrank);
SBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));RBUFF = (double *) malloc(X_dim*Y_dim*sizeof(double));...convert2Dto1D(&sbuff, X_dim, Y_dim, &SBUFF);
MPI_Barrier(COMM);MPI_Scan(&SBUFF, &RBUFF, INFO->count, INFO->type, MPI_SUM, COMM);...
Figure 10. The Scan operation source code.
5. Source code generation
Source codes will be generated automatically by the system using some templates from a database. As mentioned inSection 3, the system under consideration contains six subsystems for implementing the six different groups of frames thatare being considered. Each of the six subsystems has its own set of templates and the code generation subsystem willcollect the different source codes generated by these different subsystems to make a whole program. During the collectionof different source codes from different subsystem, the code generation subsystem will try to perform some consistencychecking. If no conflicts are found, it then proceeds with the generation of an executable code.
This section just describes the source code generation part of the subsystem for specifying I/O operations and communi-cation among processes in a parallel program. After specifying a particular kind of operation by replacing some micro-iconsshown as examples in the specification panel, the programmer can record his set of specifications using the “Set” buttonprovided in every specification panel. During this operation, the system will also perform some simple consistency checkingto make sure that there will be no conflicting definitions. If there are no conflicting definitions, this subsystem proceeds togenerate a source code for the given specification from a template database. The source code that will be generated will besaved for the code generation subsystem to combine with other codes generated from the specifications made under othersubsystems to make a complete program. Then the complete program will be compiled to generate an executable code ofthe entire program. The programmer does not have to go back to the text-based programming style and does not need to
memorize the syntaxes and semantics of the language as he does in text-based programming environment. At present, thesubsystem for code generation is not yet fully operational but is underway. So there are no full-fledge results presented butthere will be soon. For the implementation of the message-passing type of communication, the existing C version of the MPI[24, 7] library is used. For more information about the code generation subsystem of “Active Knowledge Studio,” please see[9].
6. Conclusion and future work
A visual tool or environment for specifying global reduction operations using the message-passing paradigm has beenpresented. The master/slave scheme of parallel computation was used as the model to illustrate the specification of these op-erations. There were no novel algorithms introduced to improve the current state of problem-solving, especially for scientificcomputing. Also, it did not present any experiment with existing systems. What was described was just a part of a visualprogramming tool or environment that aimed at making programming, especially for scientific computing, a bit easier. Whatwas presented was just an approach that aimed at providing a visual support for the existing MPI library. An overview of adifferent technique in making parallel programs using algorithmic “films” was also discussed. In this technique, a program-mer can navigate the films either by scenes or by stills and manipulate the stills of a film inside some specification panel byattaching operations using a language of micro-icons. Every panel is supplied with example intuitive micro-icons that willserve as guides to the programmer. He will just perform a simple fine-tuning of the example micro-icons to specify any ofthe global reduction operations. Moreover, if the intuitive meaning of the micro-icons is not enough, the system provides asimple to understand help that comes in three forms: animation, sound, and text.
What we have presented is just a part of our on-going project called “Active Knowledge Studio.” There are still manythings to be done in the future. Once the entire system is fully operational, some experiments will be conducted suchas evaluating its effectiveness as a tool, determining whether or not this can be used in making real complex scientificapplications, and validating whether or not programmers really find this environment easy to use and understand.
References
[1] O. Babaoglu, L. Alvisi, A. Amoroso, R. Davoli, and L. A. Giachini. Paralex: An environment for parallel programming in distributedsystems. In Proceedings of the 6th International Conference on Supercomputing, pages 178–187. ACM Press, 1992.
[2] A. Beguelin, J. Dongarra, G. A. Geist, R. Manchek, K. Moore, P. Newton, and V. Sunderam. HeNCE: A Users’ Guide Version 2.0,July 1994. Available online at: http://www.netlib.org/hence/ HeNCE-2.0-doc.ps.gz.
[3] W. Cai, B. Lee, and J. Zhou. Causal order delviery in a multicast environment: An improved algorithm. Journal of Parallel andDistributed Computing, 62(1):111–131, Jan. 2002.
[4] C. D. Carothers, B. Topol, R. M. Fujimoto, J. T. Stasko, and V. Sunderam. Visualizing parallel simulations in network computingenvironments: A case study. In S. Andradottir, K. J. Healy, D. H. Withers, and B. L. Nelson, editors, Proceedings of the 1997 WinterSimulation Conference, pages 110–117, 1997.
[5] D. Chappell and J. H. Trimble, Jr. A Visual Introduction to SQL. John Wiley & Sons, Inc., second edition, 2002.[6] I. F. Cruz. Doodle: A visual language for object-oriented databases. In Proceedings of the 1992 ACM SIGMOD International
Conference on Management of Data, pages 71–80. ACM Press, June 1992.[7] J. Dongarra, S. W. Otto, M. Snir, and D. Walker. A message passing standard for mpp and workstations. Communications of the
ACM, 39(7):84–90, July 1996.[8] C.-R. Dow, S.-K. Chang, and M. L. Soffa. A visualization system for parallelizing programs. In Proceedings of the 1992 ACM/IEEE
Conference on Supercomputing, pages 194–203. IEEE Computer Society Press, Dec. 1992.[9] T. Ebihara, R. Yoshioka, and N. N. Mirenkov. Program generation from film specifications. In Proceedings of the Sixth IASTED
International Conference on Software Engineering and Applications, pages 403–410. Acta Press, Nov. 2002.[10] M. Erwig. Xing: a visual xml query language. Journal of Visual Languages and Computing, 14(1):5–45, Feb. 2003.[11] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User’s Guide and
Tutorial for Networked Parallel Computing. MIT Press, 1994.[12] A. Guercio, T. Arndt, and S.-K. Chang. A visual editor for multimedia application development. In Proceedings of the 22nd
International Conference on Distributed Computing Systems Workshops, pages 296–301. IEEE Computer Society, July 2002.[13] W. D. Hills and G. L. Steele, Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, Dec. 1986.[14] C.-H. Hu and F.-J. Wang. Towards a practical visual object-oriented programming environment: Desirable functionalities and their
implementation. Journal of Information Science and Engineering, 15(4):585–614, July 1999.[15] P. Kacsuk, G. Dozsa, and T. Fadgyas. Designing parallel programs by the graphical language grapnel. Microprocessing and Micro-
programming, 41(8-9):625–643, Apr. 1996.
[16] A. D. Kshemkalyani and M. Singhal. Necessary and sufficient conditions on information for causal message ordering and theiroptimal implementation. Distributed Computing, 11(2):91–111, 1998.
[17] R. E. Ladner and M. J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831–838, Oct. 1980.[18] Y.-C. Lin and C.-S. Yeh. Optimal parallel prefix on the postal model. Journal of Information Science and Engineering, 19(1):75–83,
Jan. 2003.[19] P. Maresca, T. Arndt, A. Guercio, and P. Donadio. An enhanced visual editor for multimedia software engineering. In Proceedings
of the 8th International Conference on Distributed Multimedia Systems, pages 568–572. Knowledge Systems Institute, Sept. 2002.[20] T. F. Masterson and R. M. Meyer. Sivil: A true visual programming language for students. The Journal of Computing in Small
Colleges, 16(4):74–86, May 2001.[21] A. L. McKinney. A recent radical graphical approach to programming. The Journal of Computing in Small Colleges, 18(26):28–35,
June 2003.[22] N. Mirenkov, A. Vazhenin, R. Yoshioka, T. Ebihara, T. Hirotomi, and T. Mirenkova. Self-explanatory components: A new program-
ming paradigm. International Journal of Software Engineering and Knowledge Engineering, 11(1):5–36, Feb. 2001.[23] Y. Moses, Z. Polunsky, A. Tal, and L. Ulitsky. Algorithm visualization for distributed environments. In IEEE Symposium on
Information Visualization, pages 71–78. IEEE Computer Society, Oct. 1998.[24] MPI. MPI: A Message-Passing Interface Standard, June 1995. Available online at: http://www.mpi-forum.org/docs/ mpi-11.ps.[25] P. Newton and J. C. Browne. The code 2.0 graphical programming language. In Proceedings of the 6th International Conference on
Supercomputing, pages 167–177. ACM Press, 1992.[26] P. Newton and J. Dongarra. Overview of vpe: A visual environment for message-passing parallel programming. Technical report,
University of Tennessee at Knoxville, Nov. 1994. Available online at: http://www.cs.utk.edu/˜library/TechReports/1994/ut-cs-94-261.ps.Z.
[27] P. Newton and S. Y. Khedekar. CODE 2.0 User and Reference Manual, Mar. 1993. Available online at: http://www.cs.utexas.edu/users/code/download/ CODE_2.0_Manual.ps.
[28] J. J. Pfeiffer, Jr., R. L. Vinyard, Jr., and B. Margolis. A common framework for input, processing, and output in a rule-based visuallanguage. In Proceedings of the 2000 IEEE International Symposium on Visual Languages, pages 217–224. IEEE Computer SocietyPress, 2000.
[29] H. Ranriamparany, B. Ibrahim, and H. Liesenberg. Parallel application specification with the diva system. In ACIS First InternationalConference on Software Engineering Applied to Networking, and Parallel/Distributed Computing, pages 35–42, May 2000. Availableonline at: ftp://cui.unige.ch/VISUAL PROG/articles/snpd00.pdf.gz.
[30] R. R. Roxas and N. M. Mirenkov. A visual approach to specifying message-passing operations. In Proceedings of the InternationalConference on Parallel Processing Workshops. IEEE Computer Society, Oct. 2003. To appear.
[31] R. R. Roxas and N. M. Mirenkov. Visual input/output specification in different modes. In Proceedings of the International Conferenceon Advanced Information Networking and Applications, pages 185–188. IEEE Computer Society, Mar. 2003.
[32] R. R. Roxas and N. N. Mirenkov. Visualizing external inter-component interfaces. In Proceedings of the 22nd International Confer-ence on Distributed Computing Systems Workshops, pages 290–295. IEEE Computer Society, July 2002.
[33] R. R. Roxas and N. N. Mirenkov. Visualizing input/output specification. In Proceedings of the 8th International Conference onDistributed Multimedia Systems, pages 660–663. Knowledge Systems Institute, Sept. 2002.
[34] S. R. Sarukkai and D. Gannon. Parallel program visualization using sieve.1. In Proceedings of the 6th International Conference onSupercomputing, pages 157–166. ACM Press, May 1992.
[35] D. B. Skillicorn and D. Talia. Models and languages for parallel computation. ACM Computing Surveys, 30(2):123–169, June 1998.[36] N. Stankovic, D. Kranzlmuller, and K. Zhang. The pcg: An empirical study. Journal of Visual Languages and Computing, 12(2):203–
216, Apr. 2001.[37] N. Stankovic and K. Zhang. Visual programming for message-passing systems. International Journal of Software Engineering and
Knowledge Engineering, 9(4):397–423, 1999.[38] N. Stankovic and K. Zhang. A distributed parallel programming framework. IEEE Transactions on Software Engineering, 28(5):478–
493, May 2002.[39] J. Stasko, J. Dominique, M. H. Brown, and B. A. Price, editors. Software Visualization - Programming as a Multimedia Experience.
The MIT Press, 1998.[40] J. Tao, W. Karl, and M. Schulz. Memory access behavior analysis of numa-based shared memory programs. Scientific Programming,
10(1):45–53, 2002.[41] A. P. Vazhenin, N. N. Mirenkov, and D. A. Vazhenin. Multimedia representation of matrix computations and data. Information
Sciences, 141(1-2):97–122, Mar. 2002.[42] L. Weaver and L. Robertson. Java Studio by Example. The Sun Microsystems Press, 1998.[43] G. Wirtz. Modularization and process replication in a visual parallel programming language. In IEEE Symposium on Visual Lan-
guages, pages 72–79, Oct. 1994.[44] R. Yoshioka and N. Mirenkov. Visual computing within environment of self-explanatory components. Soft Computing, 7(1):20–32,
2002.[45] R. Yoshioka, N. Mirenkov, Y. Tsucida, and Y. Watanobe. Visual notation of film language system. In Proceedings of the 8th
International Conference on Distributed Multimedia Systems, pages 648–655. Knowledge Systems Institute, Sept. 2002.
Session 5: System Architectures (2)
V LaTTe: A Java Just-in-Time Compiler for VLIW
with Fast Scheduling and Register Allocation∗
Suhyun Kim, Soo-Mook Moon Kemal Ebcioglu, Erik Altman
[email protected] [email protected]
School of Electrical Engineering IBM T. J. Watson Research Center
Seoul National University, Korea Yorktown Heights, NY 10598
Abstract
For network computing on desktop machines, fast execution of Java bytecode programs isessential because these machines are expected to run substantial application programs writtenin Java. We believe higher Java performance can be achieved by exploiting instruction-levelparallelism (ILP) in the context of Java JIT compilation. This paper introduces V LaTTe,a Java JIT compiler for VLIW machines that performs efficient scheduling while doing fastregister allocation. It is an extended version of our previous JIT compiler for RISC machinescalled LaTTe [1] whose translation overhead is low (i.e., consistently taking one or two secondsfor SPECJVM98 benchmarks) due to its fast register allocation. V LaTTe adds the schedulingcapability onto the same framework of register allocation, with a constraint for precise in-order exception handling which guarantees the same Java exception behavior with the originalbytecode program. Our preliminary experimental results on the SPECJVM98 benchmarksshow that V LaTTe achieves a geometric mean of useful IPC 1.7 (2-ALU), 2.1 (4-ALU), and2.3 (8-ALU), while the scheduling/allocation overhead is 3.6 times longer than LaTTe’s onaverage, which appears to be reasonable.keyword: Java virtual machine, Just-in-Time compilation, VLIW, scheduling, register allo-cation, out-of-order exception handling
∗This work has been supported by IBM T. J. Watson Research Center under a sponsored research agreement.
1 Introduction
Java has rapidly emerged as a commercially successful programming language that is being used
for everything from embedded systems to enterprise servers as well as for applets [2]. One of its
success factors comes from the platform independence achieved by using the Java virtual machine
(JVM), a program that executes Java’s compiled executable called bytecode. The bytecode is
a stack-based instruction set which can be executed by interpretation on all platforms without
porting the original Java source code. Since this software-based execution is obviously much
slower than hardware execution, performance issues have constantly been raised for Java.
We believe that higher Java performance can be achieved by instruction-level parallelism
(ILP) compiler and hardware technology. In particular, we are interested in exploiting ILP in the
context of Java just-in-time (JIT) compilation, targeting a VLIW machine. The JIT compiler
is a component of the Java Virtual Machine (JVM) which translates Java’s bytecode into native
machine code immediately prior to its execution so that the Java program runs as a real executable.
Since the machine code is “cached” in the JVM, it can be used again without re-translation. We
want to exploit ILP in Java programs by translating the bytecode into VLIW code.
In order to translate the bytecode into VLIW code, it is simple to translate the bytecode into
pseudo RISC code with symbolic registers first, followed by scheduling and register-allocating the
RISC code into VLIW code. The problem is how to generate high-performance VLIW code effi-
ciently. Since the JIT compilation time is part of the whole running time, scheduling and register
allocation should be done quickly, preferably in a single integrated phase, i.e., software pipelining
followed by graph-coloring register allocation with coalescing would be the last techniques that
we want to use. This requires a trade-off between the quality and speed of scheduling and register
allocation, which poses a challenging engineering and research problem.
This paper introduces V LaTTe, a Java JIT compiler for VLIW machines that performs
fast scheduling with efficient register allocation. It is an extended version of our previous JIT
compiler for RISC machines with fast register allocation called LaTTe [1], extended with a
scheduling capability. The bytecode is first translated into pseudo RISC code where many copies
corresponding to pushes and pops between local variables and the stack are generated. V LaTTe
1
performs global scheduling while performing register allocation that coalesces most of these copies
with a local lookahead. In fact, V LaTTe integrates scheduling and register allocation seamlessly in
a unified framework. Since LaTTe JVM handles many of the Java runtime exceptions using traps
[1], V LaTTe schedules instructions with a constraint for precise out-of-order exception handling
in order to guarantee the same Java exception behavior as the original bytecode.
The rest of the paper is organized as follows. Section 2 briefly reviews the translation of the
bytecode into pseudo RISC code. Section 3 describes the fast scheduling and register allocation
technique of V LaTTe which transforms the pseudo RISC code into VLIW code. Section 4 presents
experimental results to evaluate V LaTTe. A summary follows in Section 5.
2 Translation of Bytecode into Pseudo RISC Code
When a Java method is called for the first time, VLaTTe translates its bytecode into VLIW
code. The translation process is composed of five stages. In the first stage, all control join points
and subroutines in the bytecode are identified via a depth-first traversal. In the second stage,
the bytecode is translated into a control flow graph (CFG) of pseudo RISC instructions with
symbolic registers and all final/static/private methods are inlined. In the third stage, V LaTTe
performs scheduling and fast register allocation, generating VLIW code. Finally, the VLIW code
is converted into simulation code and is executed to obtain outputs and statistics. In this section,
we briefly review the translation of bytecode into pseudo RISC code.
2.1 Java Virtual Machine and Bytecode
The Java VM is a typed stack machine [3]. Each thread of execution has its own Java stack where
a new activation record is pushed when a method is invoked and is popped when it returns. An
activation record includes state information, local variables, and the operand stack. All compu-
tations are performed on the operand stack and temporary results are saved in local variables,
so there are many pushes and pops between the local variables and the operand stack. All data
manipulated in the JVM are typed, either by primitive types or by reference types of Java objects.
The bytecode instruction set includes instructions for stack manipulation, local variable access,
2
array access, and object access. It also includes arithmetic, logical, and conversion instructions
which take source operands off the stack and push the computation result back on the stack. For
flow control, there are instructions for method invocation, subroutines, branches, and returns.
Most instructions specify the type of data to manipulate in their opcodes.
2.2 Translation of Bytecode into Pseudo RISC Instructions
We briefly review the basic idea used when translating bytecode instructions into RISC primi-
tives with symbolic registers (we used SPARC primitives). The translation rule for each bytecode
instruction is determined based solely on the type and the opcode of the instruction itself. When
the code fragments generated for independently translated bytecode instructions are concate-
nated, the result is correct, since consistent formats are used for symbolic registers, especially for
those registers corresponding to the stack elements which are formed based on a translation-time
variable, TOP. A symbolic register in the pseudo RISC code is composed of three parts:
• The first character indicates the type: a = address (object reference), i = integer, f = float,
l = long, and d = double
• The second character indicates the location: s = operand stack, l = local variable,
t = temporaries generated for JIT translation purposes
• The remaining number further distinguishes the symbolic register
For example, al0 represents a local variable indexed by 0 whose type is an object reference. is2
represents the second item of the operand stack whose type is an integer.
TOP is a translation-time variable used by (V)LaTTe (not a value computed at run-time) which
indicates the number of items on the operand stack just before translating the current bytecode
instruction. For example, if the current value of TOP is 4, “add isTOP-1, isTOP, isTOP-1”
means “add is3, is4, is3”. There is another translation-time array, type[1..TOP] which indicates
the type of each item (one of a,i,f,l,d) currently on the stack (required for translating dup/pop).
The bytecode of a method is traversed in depth-first order, starting at the beginning of the
method with TOP set to zero. Following any path of the bytecode, when a bytecode instruction
3
that pushes some item(s) on the stack is encountered, TOP is incremented by the number of
pushed items. Similarly, when a bytecode instruction that pops some item(s) is encountered, TOP
is decremented by the number of popped items. The type array type[] is appropriately updated
by the type of pushed items. For example, bytecode instructions that push a local variable onto
the operand stack or pop the stack top into a local variable are translated into symbolic register
copies as follows ($ means a translation-time action).
iload n // stack: ... => ..., (value of local variable n)mov iln, isTOP+1 // means a copy "isTOP+1 = iln"$TOP=TOP+1 // the stack now has one more item.$type[TOP]=‘I’ // the type of the new item is integer.
astore n // stack: ..., (object reference) => ...mov asTOP, aln$TOP=TOP-1 // the stack now has one less item.
It should be noted that these symbolic register copy instructions do not really generate code
because they will be coalesced away during the scheduling/allocation phase.
The arithmetic/logical/shift bytecode instructions that operate on the top items of the operand
stack can be directly mapped to one or two pseudo instructions.
iadd // stack: ..., x, y => ..., (x+y)add isTOP-1, isTOP, isTOP-1 // means "isTOP-1 = isTOP-1 + isTOP"$TOP=TOP-1 // the stack now has one less item.
For instance variable access, it is done by a single memory access since instance variables are
included in the object (along with a pointer to the method table and a 32-bit lock). Here is an
example pseudo code for reading an integer instance variable foo from an object x.
getfield <x.foo> // stack: ..., (object ref) => ..., (integer)ld [asTOP + foo_offset], isTOP // ‘‘isTOP = load @[asTOP+foo_offset]’’$type[TOP]=‘I’ // foo_offset is a constant
The JVM is required to throw a NullPointerException if the object reference is NULL. LaTTe
does not generate such check code here because if the object reference is NULL, a SIGSEGV or
SIGBUS signal will be raised by the OS during the execution of the load/store; the LaTTe JVM
includes a signal handler where the NullPointerException is thrown.
Other bytecode instructions require more elaborate translation, considering exception han-
dling, object layout, and the calling conventions. More details can be found in [1].
4
2.3 A Translation Example
Figure 1 shows a simple translation example from bytecode into pseudo RISC code.
if( b > c ) b = b - c; else b = b + c; tmp = a; a = b; b = tmp;
18 iload_217 iload_0
20 istore_021 iload_122 iload_223 if_icmple 3326 iload_127 iload_2
19 iadd
29 istore_130 goto 4333 iload_134 iload_2
28 isub
36 istore_137 iload_038 istore_339 iload_140 istore_041 iload_342 istore_143 ...
a = a + c;
35 iadd
...
(c) Pseudo code(b) Bytecode
TF
17: mov il0’,is5 18: mov il2,is6 19: add is5’,is6’,is5 20: mov is5’,il0 21: mov il1,is5 22: mov il2,is6 23: sle is5’,is6,ctp17 23: jset ctp17’, 33
27: mov il2,is6 28: sub is5’,is6’,is5 29: mov is5’,il1
33: mov il1’,is5 34: mov il2,is6 35: add is5’,is6’,is5 36: mov is5’,il1 37: mov il0’,is5 38: mov is5’,il3
40: mov is5’,il0 41: mov il3’,is5 42: mov is5’,il1
39: mov il1’,is5
43: ... Region B
Region A
(d) CFG of Pseudo Code
26: mov il1’,is5 a,b,c live & tmp dead
(a) Java Source
17: mov il0,is5 18: mov il2,is6 19: add is5’,is6’,is5 20: mov is5’,il0 21: mov il1,is5 22: mov il2,is6 23: sle is5’,is6,ctp17 23: jset ctp17’, 33 26: mov il1,is5 27: mov il2,is6 28: sub is5’,is6’,is5 29: mov is5’,il1 30: (goto 43)33: mov il1,is5 34: mov il2,is6 35: add is5’,is6’,is5 36: mov is5’,il1 37: mov il0,is5 38: mov is5’,il3 39: mov il1,is5 40: mov is5’,il0 41: mov il3’,is5 42: mov is5’,il1 43: ...
Figure 1: A translation example from bytecode into pseudo code; last uses are marked by ’.
3 Fast Scheduling and Register Allocation
The translation rules described above indicate that it is straightforward to convert the bytecode
into RISC code with symbolic registers. We now describe a fast, global scheduler with local
register allocation which schedules operations aggressively while coalescing copies efficiently, thus
conserving registers. The technique is based on list scheduling integrated with the left-edge greedy
interval coloring algorithm [4], extended to a larger region of code called the tree region.
3.1 Tree Regions
The CFG of pseudo RISC code is partitioned into tree regions which are single-entry, multiple-exit
subgraphs shaped like trees (same as extended basic blocks). Tree regions start at the beginning
of the program or at control join points and end at the end of the program or at other join points.
For example, the following CFG, composed of three basic blocks (A,B,C), includes two tree regions:
Region (1) starts at label A: and ends at the flow graph edge corresponding to goto B, while
5
A: ... goto B | Region (1) A: ... goto BB: ... if cc goto B else goto C | Region (2) B: ... if cc goto B else goto CC: ... return | C: ... return
region (2) starts at label B: and ends at the flow graph edges corresponding to goto B and return.
A tree region is the unit of optimizations in V LaTTe. Tree regions can be enlarged by code
duplication techniques such as loop unrolling to increase the opportunity of optimizations in
frequently executed parts of code (we have not exploited this yet). By working on tree regions,
V LaTTe trade-offs between quality and speed of scheduling and register allocation.
After regions are constructed for a method, last uses of symbolic registers are computed. For
stack or temporary symbolic registers, they are supposed to be dead once they are used in a
translated instruction, or their live range cannot span beyond the translated code sequence for a
bytecode; consequently, last uses of these symbolic registers can be readily identified. For local
symbolic registers, however, liveness is computed “approximately” via a single post-order traversal
of the regions such that every local symbolic register is assumed to be live on a backward edge of
the CFG. This is a conservative (upper bound) yet fast estimation of live variables at each region.
Based on this information, we can identify the last use of each local symbolic register (marked by
’ in Figure 1). When a local symbolic register becomes dead on one path of a conditional branch
while it is live on the other path, we mark the last use of it on the path where it becomes dead.
The regions are then scheduled with register allocation one by one in reverse post-order traver-
sal of the CFG of regions. In each region during the traversal, the operations of the tree itself is tra-
versed twice, first by a post-order traversal called the backward sweep, followed by a topologically-
sorted order of traversal called the forward sweep. The backward sweep collects information on
the preferred destination registers for instructions (which works as a local lookahead) while the
forward sweep performs actual scheduling and register allocation using that information, creating
VLIWs. During each traversal, a map which is a set of (symbolic, real) register pairs is collected
and propagated following the traversal direction. The map is called p map in the backward sweep
which describes preferred assignment for destination registers, and h map in the forward sweep
which describes the current register allocation result.
6
3.2 Backward Sweep
The backward sweep() algorithm in Figure 2 is called with the root of the region as an argument.
The purpose of the backward sweep is to compute the p map for destinations of instructions, based
on the required register assignment at the end of the region (if any) or at method calls/returns
according to the calling convention. For example, if a symbolic register il2 is to be allocated by
a real register r at the end of a region (e.g., due to the prior register allocation in other regions
that reach the same region boundary) and we have the operation “add is1, is2, is1” followed by
“mov is1, il2” just before the end of the region, then the destination is1 of the add is preferred
to be allocated to r to avoid a copy (saved at the preferred assignment field of instructions).
backward_sweep (instr) returns p_map // set of (symbolic,real) register pairsif (instr is the header of a next region) // reached a region boundary
p = NULL;if (there is an old h_map, h_old, saved at instr) // the region boundary has been visited
p = h_old; // previously via a prior forward sweep in other regionsreturn p;
p = backward_sweep(NextInstr(instr));if (instr has more than one successor)
for (each successor s of instr other than NextInstr(instr)) p = merge_map(p, backward_sweep(s));
// Now, compute the p_map above instrif (instr has a target symbolic register x AND there is a preferred assignment (x,r) in p)
preferred_assignment(instr) = r;if (instr is a copy x=y )
insert (y,r) into p;delete (x,r) from p; // (x,r) is not propagated above instr
else if (instr is a call)
insert (argument symbolic register, out register) pairs into p;delete (returned value symbolic register, *) from p;
else if (instr is a return)
insert (is0, %i0) into p; // adding (returning value symbolic register, return register) into preturn p;
Figure 2: The backward sweep algorithm.
If the preferred map for x is r under a copy “mov y, x”, then the preferred map for y becomes
7
r above the copy. At a conditional branch, the preferred map of both paths are unioned, yet if
there are two different preferred assignments for a symbolic register, the one coming from the
more likely path is taken (when there is no profiling done, an arbitrary one is taken).
3.3 Forward Sweep
After the preferred assignments for instructions are computed, a forward sweep is performed to
schedule the region while allocating real registers. The earliest scheduling time of each operation
is determined using an array called avail[] while register allocation is performed using the h map.
The right-hand-side of each operation is readily allocated with h map, and searching for the target
register occurs concurrently with finding the VLIW where the operation will be scheduled.
The schedule() algorithm in Figure 3 is called at the root of the region with h, an h map that
is saved at the root. From h, we can determine refcount that shows how many symbolic registers
are mapped to each real register and freereg which indicates the set of real registers to which
no symbolic registers map. For the first region at the beginning of a method, h is initialized by
the map of parameters according to the calling convention. As regions are allocated in reverse
post-order, h at the end of a region is propagated to the beginning of the next region.
When scheduling a tree region, we actually create a tree of tree VLIWs [5] during the forward
sweep. That is, we visit each instruction in topologically sorted order (the exact order will be
determined by priorities of the paths in the region which will be described shortly) and schedule it
in the earliest possible VLIW with its source and target operands being allocated to real registers.
Let us first describe the register allocation part of the forward sweep. When an instruction
z=x+y is encountered, the registers are allocated as follows. First, the right-hand-side (RHS) is
allocated simply as h[x]+h[y] no matter where it is scheduled. If x is the last use, the refcount of
the real register h[x] is decremented by one, and h[x] is removed from the liveRegs if the refcount
becomes zero, and (x, h[x]) is deleted from h (meaning that x is now dead). We do the same for y.
For the target operand z, if the instruction is a copy z=x and x was mapped to a real register
r, then we just update h by allocating z to r (h[x]=r), meaning that the copy is coalesced and
does not generate any code. Otherwise, we have to allocate a real register for z, but we need to
decide first where to schedule the instruction before we choose the real register (unlike its source
8
schedule( root_instr, given_h )
init_path = create_path(continuation=root_instr, prob=1.0, pathLength=1,avail="all 0s",h=given_h,refcount,liveRegs(computed from h));
add init_path to pathlist;
while( path_list != NULL ) )p = First item in pathlist; // Most probable pathINSTR = path.continuation;
if( IsRegionEnd(INSTR) )// Generate copy ops at end of a region to match register mappingsif( INSTR does not have h_old assigned yet ) h_old(INSTR) = h;else GenerateCopyOps( h, h_old(INSTR) );
remove p from pathlist;continue;
schedule_an_instr( INSTR, p );
// update p & pathlistif( INSTR has multiple successors )
remove p from pathlist;for( Each successor s of INSTR )
pathlist = insert_path_in_prob_order( clone_path(p,s), pathlist );// clone_path(p,s) makes a copy of path p continuing at s. Also:// 1- probability of new path is set to orig. probability * prob(p,s)// 2- h, refcount, liveRegs of a path are updated// based on symbolic registers dead at s
else p.continuation = next_instr( INSTR ); // just continue with next op
Figure 3: The forward sweep algorithm.
operands). When the VLIW to place the instruction is determined, we check if there is a preferred
assignment for it (a real register that z will eventually be mapped into) and if it is dead at that
VLIW. If so, we choose the register. Otherwise, we choose any dead register at that VLIW (if
every register is live, we need to perform a spill). Now, the pair (z, the chosen register) is inserted
into h and the updated h will be used for register allocation of the next instruction.
Starting from the root of a region, all instructions are register-allocated as described above
while being scheduled. When the header (root) of the next region is encountered, we save the
current h map at the root so that the forward sweep at the next region can start with an initial
h map. Since the header is a join point, more than one forward sweep reaches the same header.
If no h map is saved at the header when the current forward sweep reaches it, we simply save the
9
current h map there. Otherwise, we need to reconcile the current and the old h map by inserting
some copies if needed. The reconciling problem, in fact, is similar to replacing SSA φ nodes by a
set of equivalent move operations [6] and we can use the same solution to minimize those copies1.
schedule_an_instr( INSTR, p )
Parse INSTR as "z=x+y";real_rhs = "p.h[x]+p.h[y]";OldLiveRegs = p.liveRegs; // checkpoint liveRegs to detect change
// check last usesif( x is a last use )
if( --p.refcount[p.h[x]] == 0 ) p.liveRegs -= p.h[x];p.h[x] = NULL;
if( There is a second operand y != x && y is a last use )
if( --p.refcount[p.h[y]] == 0 ) p.liveRegs -= p.h[y];p.h[y] = NULL;
// Copy operations do not generate code, just change h mapif( INSTR is a copy z=x )
p.h[z] = real_rhs; // i.e. the old p.h[x]p.refcount[p.h[z]]++;p.liveRegs += p.h[z];return;
earliest, vliw_list = get_sched_time( INSTR, p, OldLiveRegs );insert_ins_in_sched( p, INSTR, vliw_list, earliest, real_rhs );
Next we describe the scheduling part of the forward sweep. The algorithm schedule() maintains
a priority queue called pathlist for currently open paths in the tree region where there is scheduling
going on. Instructions are visited and scheduled at the end of the currently most probable path
(where probabilities are obtained through profiling, or just guessed). Initially, the algorithm starts
with a single path. As a conditional branch is scheduled, the current path is cloned for the two
paths and are inserted into the pathlist in order of probability (our experiments do not use it).
For a given operation z=x+y encountered during the forward sweep, we allocate its RHS as h[x]
+ h[y] as described previously. Then we find the first VLIW on the current path starting from
the root VLIW of the region where x is ready and the first VLIW where y is ready, and takes
the later one among them (each path maintains an array avail that maps each physical register
1These copies can also be scheduled in VLIWs where source registers are available and target registers are free.
10
to the first VLIW on this path where the register is ready). Let us call this VLIW the earliest
VLIW. The operation will be scheduled in one of VLIWs between the earliest VLIW and the last
VLIW on a given path.
get_sched_time( INSTR, p, OldLiveRegs )
// Make an array of VLIWs from the root to the end pathvliw_list = make_vliw_array( p.curr_vliw );
// Get earliest VLIW from data dependences of source operands.earliest = MAX( p.avail[ h[INSTR’s each source] ] );
// Check if allowed to schedule earlier than the last VLIWif( INSTR has multiple branch targets || INSTR is a store )earliest = MAX( earliest, p.pathLength - 1 );
if( OldLiveRegs != p.liveRegs )// Live real regs changed ==> recompute live variablesvliw_list[p.pathLength].liveRegs = p.liveRegs;for( time = p.pathLength-1; time > earliest; time-- )
l = compute_live( vliw_list[time] );// Stop when no more changesif( vliw_list[time].liveRegs == l ) break;vliw_list[time].liveRegs = l;
// Compute real regs that are free until the endfree_till_end = all 1s;for( time = p.pathLength; time > earliest; time-- )
free_till_end &= ~vliw_list[time].liveRegs;vliw_list[time].free_until_end = free_till_end;
return earliest, vliw_list;
Before we schedule the operation at some VLIW, we need to decide the real register for its
target operand, which should be dead at that VLIW. Each VLIW maintains a set of live registers
to help the decision. The problem is that some of these currently live registers might become
dead when we move up an operation whose RHS allocation results in newly dead registers (when
the RHS includes last uses); the newly dead register can also be used as a target register for the
operation being scheduled. In order to obtain this information locally even before we actually
move up the operation, the scheduling proceeds as follows. If any register has gone dead when
allocating the RHS of the current operation, the algorithm recalculates the live real registers of
the preceding VLIW on the current path (using new dead registers), then does the same for the
11
VLIW just before that, until a VLIW is encountered whose live registers do not change, or the
earliest VLIW is encountered. Then the algorithm walks forward on that path from the earliest
VLIW until it finds a VLIW where there is a functional unit available and a dead register r that
can be allocated to the target z and is not live until the end of the current path (i.e., a dead
register at that VLIW is not necessarily dead until the end of the path). The operation r = h[x]
+ h[y] is inserted at that VLIW.
Then we do the final cleanup. h[z] is updated into r and r is made to be live between the
VLIW where the operation is scheduled and the last VLIW on the current path. Also, the real
registers of the RHS between the earliest VLIW and the VLIW where it is actually scheduled.
The place where z is ready (avail[h[z]]) is set to the VLIW that is away by the latency of the
operation from the VLIW where it is scheduled. Stores, traps, and branches are always scheduled
in the last VLIW on a given path.
3.4 Scheduling Constraints for Java Exceptions
V LaTTe catches Java runtime exceptions such as NullPointException using traps (exceptions).
This imposes a constraint on the scheduling algorithm that the behavior of Java exceptions in the
program should not be affected. For potentially exceptional instructions such as loads, this means
they cannot be scheduled speculatively above a branch or they cannot be reordered, which would
severely constrain the scheduling. In order to achieve a high ILP while guaranteeing in-order
checking of exceptional consitions, we have provided out-of-order execution support via exception
tags added to each register, out-of-order version of each opcode that may raise an exception, and a
commit operation that checks the exception tags. We have also enhanced the scheduling algorithm
for the hardware support (more precise than in [7, 8]). This topic is beyond the scope of this
paper and will be discussed elsewhere. Currently, V LaTTe guarantees identical Java exception
behavior with the original bytecode program.
3.5 A Scheduling and Register Allocation Example
Figure 4 describes the scheduling and register allocation process in detail for the example in
Figure 1. The final VLIW schedule is shown in Figure 5. The example shows that the original
12
insert_ins_in_sched( pathlist, p, INSTR, vliw_list, earliest, real_rhs )
// Go from earliest to latest to find the first VLIW where operation fitsreal_lhs = NULL;for( time = earliest; time < p.pathLength; time++ )
vliw = vliw_list[time];nextvliw = vliw_list[time+1];
if( vliw has no resource for INSTR ) continue;if( INSTR has a dest reg z )
if( INSTR has a preferred assignment r for its destination z,AND r is in nextvliw.free_until_end )real_lhs = r;break;
else if( nextvliw.free_until_end != NULL )real_lhs = Smallest free real register in nextvliw.free_until_end;break;
else continue; // resources OK, but dest reg not available else break; // resources OK, and dest reg not needed
// Append new VLIW if operation did not fit anywhereif( time == p.pathLength )append_list( vliw_list, create_new_vliw( p.pathLength++ ) );
schedule "real_lhs = real_rhs" in vliw_list[time];for( each register s in real_rhs )for( Each VLIW v, avail[s] < v <= time )Add s to v.liveRegs;
for( Each VLIW v, time+1 <= v <= lastvliw+1 )Add the dest reg. real_lhs (if present) to v.liveRegs;
p.h[z] = real_lhs;p.refcount[p.h[z]]++;p.liveRegs += p.h[z];p.avail[p.h[z]] = time+op_latency;
bytecode can be executed by only two VLIWs and most of the copies in the initial pseudo code
are coalesced away. The remaining copies are necessary since there is swapping of two variables.
4 Preliminary Experimental Results
We have implemented V LaTTe on the SPARC platform. The current version used for experiments
has not been fully optimized and tuned. In particular, it does not employ any object-oriented
(OO) optimization techniques such as customization or specialization combined with inlining,
which will be useful for increasing the ILP opportunity (currently, the number of instructions in
a tree region is too small due to the OO feature of small method size). Also, the JIT translation
13
Pseudo Code Scheduling Actions======================= =============================
Start:h[v]=r means symbolic register vis mapped to hardware register r.v’ indicates last use of v.Assume initial mapping ash[il0]=r0, h[il1]=r1, h[il2]=r2.
17: mov il0’,is5 h[is5]=h[il0]=r018: mov il2,is6 h[is6]=h[il2]=r219: add is5’,is6’,is5 h[is5]=r0,h[is6]=r2 are ready in VLIW 1
is5 is last use of r0, free h[is5]=r0Assign r0 (first free reg in VLIW 1)to dest is5, New h[is5]=r0
==> Schedule add r0,r2,r0 in VLIW 1
20: mov is5’,il0 h[il0]=h[is5]=r021: mov il1,is5 h[is5]=h[il1]=r122: mov il2,is6 h[is6]=h[il2]=r223: sle is5’,is6,ctp17 h[is5]=r1,h[is6]=r2 are ready in VLIW 1
Assign cc0 to ctp17, New h[ctp17]=cc0==> Schedule sle r1,r2,cc0 in VLIW 1
23: jset ctp17’, 33 h[ctp17]=cc0 are ready in VLIW 2==> Schedule jset cc0, 33 in VLIW 2
Make two paths
(Let us assume false path has high priority)26: mov il1’,is5 h[is5]=h[il1]=r127: mov il2,is6 h[is6]=h[il2]=r228: sub is5’,is6’,is5 h[is5]=r1,h[is6]=r2 are ready in VLIW 1
is5 is last use of r1, free h[is5]=r1Assign r3 (first free reg in VLIW 1, r1is free only in false path of VLIW 2)to dest is5, New h[is5]=r3
==> Schedule sub r1,r2,r3 in VLIW 1
29: mov is5’,il1 h[il1]=h[is5]=r330: (goto 43) remove false path and save h at 43
33: mov il1’,is5 h[is5]=h[il1]=r134: mov il2,is6 h[is6]=h[il2]=r235: add is5’,is6’,is5 h[is5]=r1,h[is6]=r2 are ready in VLIW 1
is5 is last use of r1, free h[is5]=r1Assign r1 (now free in VLIW 1 also)to dest is5, New h[is5]=r1
==> Schedule add r1,r2,r1 in VLIW 1
36: mov is5’,il1 h[il1]=h[is5]=r137: mov il0’,is5 h[is5]=h[il0]=r038: mov is5’,il3 h[il3]=h[is5]=r039: mov il1’,is5 h[is5]=h[il1]=r140: mov is5’,il0 h[il0]=h[is5]=r141: mov il3,is5 h[is5]=h[il3]=r042: mov is5’,il1 h[il1]=h[is5]=r0
We should reconcile h_map(see figure 5)43: ...
Figure 4: The scheduling/allocation process for the example.
14
h[il1]=r3h[il2]=r2
h[il0]=r1h[il1]=r0h[il2]=r2
use il0(r0), il1(r3), il2(r2)
h[il0]=r0mov r0, r3mov r1, r0
use il0(r0), il1(r3), il2(r2)
h[il0]=r0h[il1]=r3h[il2]=r2
43: ...
VLIW 119: add r0, r2, r023: sle r1, r2, cc028: sub r1, r2, r335: add r1, r2, r1
VLIW 223: jset cc0, 33
TF
mov r0, r3mov r1, r0
avail[r0]=1avail[r1]=1
(c) Final Schedule(a) Conflicts at join points
VLIW 119: add r0, r2, r023: sle r1, r2, cc028: sub r1, r2, r335: add r1, r2, r1
VLIW 223: jset cc0, 33
TF
(b) Reconcile register allocation
VLIW 119: add r0, r2, r023: sle r1, r2, cc028: sub r1, r2, r335: add r1, r2, r1
VLIW 223: jset cc0, 33
TF
Conflict !!
Figure 5: The final VLIW schedule.
speed has not been fully optimized. Nonetheless, we report some preliminary experimental results
with the SPECJVM98 benchmark suites to evaluate the quality and speed of V LaTTe.
4.1 Experimental Environment
Our Java benchmarks are composed of 8 programs from the SPECJVM98 benchmarks which are
depicted in Table 1. Our test machine is a SUN Ultra5 270MHz with 256 MB of memory running
Solaris 2.6, tested in single-user mode.
Benchmark Description Bytes200 check JVM checker 27606201 compress Compress Utility 24326202 jess Expert Shell System 45392209 db Database 26425213 javac Java Compiler 92000222 mpegaudio MP3 Decompressor 38930227 mtrt Raytracer 34009228 jack Parser Generator 51380
Table 1: Java Benchmark Description and Translated Amount of Bytes.
The target VLIW machines are based on parametric resource constraints of n ALUs and n-way
branching, and we have evaluated three machines, with n = 2, 4, and 8. Each ALU can perform
either an arithmetic operation or a memory operation in one cycle, while branches are handled by
a dedicated branch unit. All these machines are assumed to have 96 integer registers (32 SPARC
+ 64 scratch for renaming), 48 floating point registers, 8 condition registers, and perfect caches.
Since our implementations runs on the SPARC machine, a set of SPARC simulation instruc-
tions (in direct binary form) is generated for each VLIW. These SPARC instructions emulate the
15
actions of each VLIW and update statistics. In fact, we use a compiled simulation method similar
to Shade [9] for simulating our VLIW machine on the SPARC.
V LaTTe’s JIT translation is closely coordinated with LaTTe JVM’s on-demand exception
handling which involves a collaboration between the translated code, the JIT compiler, and the
JVM for streamlining the translation process across exception handlers and finally blocks.
LaTTe JVM is also equipped with a lightweight monitor [10] and a fast mark-and-sweep garbage
collector.
4.2 Useful IPC Results
Figure 6 depicts the useful IPC for each benchmark for the three VLIW machines. The useful IPC
is calculated by the number of SPARC instructions in the sequential execution trace (where there
is a single SPARC instruction in a VLIW) divided by the number of executed VLIW instructions.
Figure 6: Useful IPCs.
Considering the non-numerical and object-oriented nature of many of our benchmarks, the
geometric mean of useful IPC 1.7 (2-ALU) and 2.1 (4-ALU) appear to be competitive, while that
of the 8-ALU machine is somewhat lower. This is mainly due to small tree regions, especially in
16
213 javac, 227 mtrt, and 228 jack where many OO features limits the ILP opportunity. We
expect a better IPC by employing OO optimizations which are under development.
4.3 Scheduling and Register Allocation Overhead
In order to evaluate the scheduling/register allocation overhead of V LaTTe, we measured the
running time of phase 3 and compared it with the running time of phase 3 of LaTTe which
performs fast register allocation only (the result is shown in Table 2). LaTTe’s JIT compilation
time mostly comes from register allocation overhead and is shown to be tiny compared to the whole
running time of programs (i.e., consistently taking 1-2 seconds for the SPECJVM98 which runs
40-80 seconds)2 [1]. In this way, we can evaluate the translation overhead of V LaTTe indirectly.
Benchmark Register Allocation R.A + Scheduling RatioTime (LaTTe) Time (V LaTTe) (V LaTTe/LaTTe)
200 check 0.87 3.26 3.75201 compress 0.78 2.70 3.46202 jess 1.37 4.50 3.28209 db 0.82 2.89 3.52213 javac 2.66 9.29 3.49222 mpegaudio 1.29 5.28 4.09227 mtrt 1.02 3.98 3.90228 jack 1.94 6.15 3.17
geomean 3.57
Table 2: Scheduling/Register Allocation Time (seconds).
On average, the unified scheduling/allocation time of V LaTTe is about 3.6 times longer than
the running time of LaTTe’s fast register allocator. Since V LaTTe will eventually run on a
faster VLIW machine, its translation time appears to be reasonable. The translation time can be
reduced with further tuning and with adaptive compilation which is also under development.
Since the current implementation is a research prototype, the absolute running time might
be less meaningful. We examined how the scheduling/allocation time is affected by the size of a
method by measuring the scheduling/allocation time of each method separately. Figure 7 depicts
the plot of (number of instructions of a method, scheduling/allocation time) logarithmically for all
methods in SPECJVM98 benchmarks. The figure shows that the scheduling/allocation overhead2LaTTe’s overall performance on SPECJVM98 is 2.5 times higher than SUN JDK 1.1.6 JIT compiler and similar
to JDK 1.2 beta4 (HotSpot) without exploiting adaptive compilation [1].
17
is linear to the number of instructions in the method and the slope is 1 in logarithmic scale,
indicating practically linear complexity of our algorithm.
Figure 7: Linear translation time of V LaTTe.
5 Summary
In this paper, we have described the design and implementation of V LaTTe, a Java JIT compiler
with fast and efficient scheduling and register allocation for VLIW machines. Our linear-scan
algorithm with integrated scheduling and register allocation with a local lookahead is an elaborate
engineering solution that trade-offs the quality and speed of scheduling/allocation, tuned just
enough for Java JIT compilation. This is confirmed by our preliminary experimental results on
translation overhead and the IPC, which indicates that V LaTTe is promising and would lead to a
more competitive JIT compiler if combined with other optimizations including OO optimizations
and adaptive compilation, which is left as future work.
18
References
[1] B. Yang et. al. LaTTe: A Java VM just-in-time compiler with fast and efficient register allocation.submitted for publication, 1999.
[2] S. Sinnhal and B. Nguyen. The Java Factor. Communications of the ACM, 41(6):34–47, June 1998.
[3] F. Yellin and T. Lindholm. The Java Virtual Machine Specification. Addison-Wesley, 1996.
[4] A. Tucker. Coloring a Family of Circular Arcs. SIAM Journal of Applied Mathematics, 29(3):493–502,Nov 1975.
[5] K. Ebcioglu. Some Design Ideas for a VLIW Architecture for Sequential Natured Software. InM. Cosnard et al., editor, Parallel Processing(proceedings of IFIP WG 10.3 Working Conference onParallel Processing (Pisa, Italy), pages 3–21. North Holland, April 1988.
[6] R. Morgan. Building an Optimizing Compiler. Digital Press, 1998.
[7] K. Ebcioglu and E. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. InProceedings of ISCA’97, 1997.
[8] S. Mahlke et. al. Sentinel scheduling for VLIW and superscalar processors. In Proceedings of ASPLOS-5, Oct. 1992.
[9] R. F. Cmelik and D. Keppel. Shade: A Fast Instruction Set Simulator for Execution Profiling.Technical Report TR-93-12, Sun Microsystems, July 1993.
[10] B.-S. Yang, J. Lee, J. Park, S.-M. Moon, and K. Ebcioglu. Lightweight Monitor in Java VirtualMachine. In the 3rd Workshop on Interaction between Compilers and Computer Architectures, Oct1998.
19
1
Speeding-up Multiprocessors Running DSS Workloads
through Coherence Protocols Pierfrancesco Foglia, Roberto Giorgi, Cosimo Antonio Prete
Dipartimento di Ingegneria dell’Informazione, Facolta’ di Ingegneria, Universita’ di Pisa, Via Diotisalvi, 2 – 56126 PISA (Italy)
[email protected], prete,[email protected]
Abstract In this work, we analyze how a DSS (Decision Support System) workload can be accelerated in the case of a shared-bus
shared-memory multiprocessor, by adding simple support to the classical MESI solution for the coherence protocol. The
DSS workload has been set-up utilizing the TPC-D benchmark on the PostgreSQL DBMS. Analysis has been performed via
trace driven simulation and the operating system effects are also considered in our evaluation.
We analyzed a basic four-processor and a high-end sixteen-processor machine, implementing MESI and two coherence
protocols which deal with migration of processes and data: PSCR and AMSD. Results show that, even in the four processor
case, for a DSS workload the use of a write-update protocol with a selective invalidation strategy for private data improves
performance (and scalability) with respect to a classical MESI based solution, because of the access pattern to shared data
and the lower bus utilization due to the absence of invalidation miss when we eliminate the contribution of passive sharing.
In the 16 processor case, and especially in situation when the scheduler cannot apply the affinity requirements, the gain
becomes more important: the advantage of a write-update protocol with a selective invalidation strategy for private data, in
term of execution time, could be quantified in a 20% relatively to the other evaluated protocols. This advantage is about
50% in the case of high cache-to-cache transfer latency.
Keywords: Shared-Bus Multiprocessors, DSS system, Cache Memory, Coherence Protocol, Sharing Analysis, Sharing
Overhead.
1 Introduction An ever-increasing number of multiprocessor server systems shipped today run commercial workloads [50]. These
workloads include databases applications such as online transaction processing (OLTP) and decision support system (DSS),
file servers and application servers [34]. Nevertheless, technical workloads were widely used to drive the design of current
multiprocessor systems [34], [10], [50] and different studies have shown that commercial workloads exhibit different
behavior from technical ones [20] [25].
The simpler design for a multiprocessor system is a shared-bus shared-memory architecture [51]. In shared-bus systems,
processors access the shared memory through a shared bus. The bus is the bottleneck of the system, since it can easily reach
a saturation condition, thus limiting the performance and the scalability of the machine. The classical solution to overcome
this problem is the use of per-processor cache memories [22]. Cache memories introduce the coherency problem and the
need for adopting adequate coherence protocols [37], [38]. The main coherence protocol classes are Write-Update (WU) and
2
Write-Invalidate (WI) [37]. WU protocols update the remote copies on each write involving a shared copy. Whereas, WI
protocols invalidate remote copies in order to avoid updating them. Coherence protocols generate several bus transactions,
thus accounting for a non-negligible overhead in the system (coherence overhead). Coherence overhead may have a negative
effect on the performance and, together with the accesses pattern to application data, determines the best protocol choice for
a given workload [55], [56], [19]. Different optimizations to minimize coherence overhead have been proposed [12], [37],
[42], [18], also acting at compile time and architectural level (as the adoption of adequate coherence protocols [39], [18]).
When the performance achieved by shared-bus shared-memory multiprocessor is not sufficient, and this is typical for
DBMS applications, an SMP (Symmetrical Multi-Processing, which includes shared-bus shared-memory architectures) or a
NUMA (Non Uniform Memory Access) approach can be utilized [51], [22]. In the 1st case, a crossbar switch interconnects
the processing elements, in the 2nd an Interconnection Network. Such solutions increase the communication bandwidth
among elements, thus allowing more CPU to be added to the system, but at the cost of more expensive and complex
communication network. In both the design, the basic building bock (node) may be a single processor system or, better, a
shared-bus shared-memory multiprocessor (example are the HP V-Class and the SGI Origin families of multiprocessors
[52]). In this way, by adding high performance nodes, we can achieve the desired level of performance with only a little
number of elements, simplifying the crossbar or IC network design, and lowering the price of the whole system.
Unfortunately, due to limited bus bandwidth, only a small number of CPU (max 4 for the Pentium family of CPU [32]) may
be included in the single node.
The aim of this paper is to analyze the scalability of shared-bus shared-memory multiprocessors running commercial
workload and DSS applications in particular, and to investigate solutions, as concern the memory subsystem, which can
increase the processing power of such architectures. In this way, we can meet the performance requirement of commercial
applications with a single shared-bus shared-memory machine, or we can adopt more performing nodes in an SMP or
NUMA design, allowing simpler and cheaper design of switch or IC networks.
Decision Support Systems are based on DBMS; they are utilized to extract management information and analyze huge
historical data: a typical DSS activity is performed via a query to look for ways to increase revenues, such as an SQL query
to quantify the amount of revenue increase that would have resulted from eliminating a company discount in a given
percentage in a given year [53]. DSS queries are long running, typically scanning large amounts of data [50]. In particular,
such queries are, in most part, read-only queries. Consequently, as already observed in [45], [52], coherence activity and
transactions involve essentially DBMS metadata (i.e. data structures that are needed by the DBMS to work rather than to
store the data) and particularly data structure needed to implement software locks [45], [52]. The type of access pattern to
those data is one-producer/many-consumers. Such pattern may advantage solution based on the Write Update family of
coherence protocols, especially if there is a high number of processes (the ‘consumers’) that concurrently read shared data
[19]. These considerations can suggest the adoption of other coherence protocols, with respect to the usually utilized MESI
WI protocol [36].
3
In our evaluation, the DSS workload of the system is generated by running all the TPC-D queries [44] (through the
PostgreSQL [49] DBMS) and several Unix utilities, which both access the file system and interface the DBMS with system
services running concurrently. Our methodology relies on trace-driven simulation, by means of the “Trace Factory”
environment [17], and on specific tools for the analysis of coherence overhead [14].
The performance of the memory subsystem – and therefore of the whole system – depends on cache parameters,
coherence management techniques, and it is influenced by operating system activities like process migration, cache affinity
scheduling, kernel/user code interference and virtual memory mapping. The importance of considering operating system
activity has been highlighted in previous works [6], [5], [41]. Process migration, needed to achieve load balancing in such
systems, increases the number of cold/conflict misses and also generates useless coherence overhead, known as passive
sharing overhead [18]. As passive sharing overhead dramatically decreases the performance of pure WU protocols [18],
[47], we considered in our analysis an hybrid WU protocol, PSCR [18], which adopts a selective invalidation strategy for
private data and an hybrid WI protocol, AMSD [8], [33], which deals with migration of data and, then, of processes.
Our results show that, in the case of a 4-processor configuration, performance of DSS workload is moderately influenced
by cache parameters and the influence of coherence protocol is minimal. In 16 processors configurations, the performance
differences due to the adoption of different architectural solutions are significant. In high-end architectures, MESI is not the
best choice and workload changes can nullify the action of affinity-based scheduling algorithms. Architectures based on a
write-update protocol with a selective invalidation strategy for private data outperform the ones based on MESI of about
20%, and adapt better to workload. We will also show solutions that are less sensitive to cache-to-cache transfer latency.
The rest of the paper is organized as follows. In Section 2, we discuss architectural parameters and issues related to the
coherence overhead. In Section 3, we report the results of studies related to the analysis of workloads similar to ours,
differentiating our contribution. In Section 4, we present our experimental setup and methodology. In Section 5, we discuss
the results of our experiments. In Section 6, we draw the conclusions.
2 Issues Affecting Coherence Overhead Cache coherency maintaining involves a number of bus operations. Some of them are overhead that adds up to the basic
bus traffic (the traffic necessary to access main memory). Three different sources of sharing may be observed: i) true sharing
[40], [42], which occurs when the same cached data item is referenced by different processes concurrently running on
different processors; ii) false sharing [40], [42], which occurs when several processors reference a different data item
belonging to the same memory block separately; iii) passive [30] or process-migration [1] sharing, which occurs when a
memory block, though belonging to a private area of a process, is replicated in more than one cache, as a consequence of the
migration of the owner process.
The main coherence protocol classes are Write-Update (WU) and Write-Invalidate (WI) [37]. WU protocols update the
remote copies on each write involving a shared copy. Whereas, WI protocols invalidate remote copies in order to avoid
updating them. An optimal selection for the coherence protocol can be made by considering the traffic induced by the two
4
approaches in case of different sharing and its effects on performances [55]. The coherence overhead induced by a WU
protocol is due to all the operations needed to update the remote copies. Whereas, a WI protocol invalidates remote copies
and processors generate a miss on the access to the invalidated copy (invalidation miss). The access patterns to shared data
determine the coherence overhead. In particular, fine-grain sharing denotes high contention for shared-data; sequential
sharing is characterized by long sequences of writes to the same memory item performed by the same processor. A WI
protocol is adequate in the case of sequential sharing, whilst, in general, a WU protocol performs better than WI for program
characterized by fine-grain sharing [56], [19], [16] (in alternative, a metric based on the write-run and external rereads model
can be used to estimate the best protocol choice [2]). In our evaluation, we considered MESI protocol as baseline, because it
is used in most of the high performance processors today (like AMD K5 and K6, PowerPC series, SUN UltraSparc II, SGI
R10000, Intel Pentium, Pentium Pro series (Pro, II, III), Pentium 4, and IA-64/Itanium). Then, based on the previous
considerations on the access pattern of DSS workloads, we included a selective invalidation protocol based on a WU policy
for manage shared data, which limits most effects of process migration (PSCR [18]) and a WI protocol specifically designed
to treat the data migration (AMSD [8], [33]).
Our implementation of MESI uses the classical MESI protocol states [36] and the following bus transactions: read-block
(to fetch a block), read-and-invalidate-block (to fetch a block and invalidate any copies in other caches), invalidate (to
invalidate any copy in other caches), and update-block (to write back dirty copies, when they need to be replaced). This is
mostly similar to the implementation of MESI in Pentium Pro (Pentium II) processor family [32]. The invalidation
transaction used to obtain coherency has, as a drawback, the need to reload a certain copy, if a remote processor uses again
that copy, thus generating a miss (Invalidation Miss). Therefore, MESI coherence overhead (that is the transactions needed
to enforce coherence) is due both to Invalidate Transactions and Invalidation Misses.
PSCR (Passive Shared Copy Removal) adopts a selective invalidation scheme for private data, and uses a Write-Update
scheme for shared data, although it is not a purely Write-Update protocol. A cached copy belonging to a process private area
is invalidated locally as soon as another processor fetches the same block [18]. This technique reduces coherence overhead
(especially passive sharing overhead), otherwise dominant in a pure WU protocol [30]. Invalidate transactions are eliminated
and coherence overhead is due to Write Transactions and Invalidation Misses caused by local invalidation (Passive Sharing
Misses).
AMSD [8], [33] is designed for Migratory Sharing, which happens when the control over shared data migrates from one
process to another running on a different processor. The protocol identifies migratory-shared data dynamically to reduce the
cost of moving them. The implementation relies on an extension of a common MESI protocol. Though designed for
migratory sharing, AMSD may have some beneficial effects also on passive sharing. AMSD coherence overhead is due to
Invalidate Transactions and Invalidation Misses.
The process scheduling strategy, cache parameters, and the bus features also influence the coherence overhead and,
therefore, the multiprocessor performance. The process scheduling guaranties the load balancing among the processors by
scheduling a ready process on the first available processor. Although cache affinity is used, a process may migrate on a
5
different processor rather than on the last used processor, producing at least two effects: i) some misses when it restarts on a
new processor (context-switch misses); ii) useless coherence transactions due to passive sharing. These situations may
become frequent due to dynamicity of a workload driven by the user requests, like DSS ones. Process migration influences
also access patterns to data and, therefore, the coherency overhead.
All the factors described above have important consequences on the global performance that we shall evaluate in the
Section 5.
3 Related Work on DSS systems In this Section, we consider a summary of main results of evaluations for DSS and similar workloads on several
multiprocessor architectures and operating systems. The research of a realistic evaluation framework for shared memory
multiprocessor evaluations [34], [50] motivated many studies that consider benchmarks like TPC-series (including DSS,
OLTP, WEB-server benchmarks) representative of commercial workloads [45], [3], [4], [31], [28], [27], [25].
Trancoso et al. study the memory access patterns of a TPC-D-based DSS workload [45]. The DBMS is Postgres95
running on a simulated four-processor CC-NUMA multiprocessor. For cache block sizes ranging from 4 to 128 bytes and
cache capacities from 128K-bytes to 8M-bytes, the main results are that both large cache blocks and data prefetching help,
due to spatial locality of index and sequential queries. Coherence misses can be more than 60% of total misses in queries that
uses index scan algorithms for select operations.
Barroso et al. evaluate an Alpha 21164-based SMP memory system through hardware counter measurements and SimOS
simulations [3]. Their system was running Digital-UNIX and Oracle-7 DMBS. The authors consider a TPC-B database
(OLTP benchmark), TPC-D (DSS queries), and the Altavista search engine. They found that memory is accounting for 75%
of stall time. In the case of TPC-D and Altavista workloads, the size and latency of the on-chip caches influences mostly the
performance. The number of processes per processor, which is usually kept high in order to hide I/O latencies, also
significantly influences cache behavior. When using 2-, 4-, 6-, and 8- processor configurations, they found that coherency
miss stalls increase linearly with the number of processors. Beyond an 8M-bytes outer level cache, they observed that true
sharing misses limit performance.
Cao et el. examine a TPC-D workload executing on a Pentium-Pro 4-processor system, with Windows NT and MS SQL
Server [4]. Their goal is to characterize this DSS system on a real machine, in particular, regarding processor parameters, bus
utilization, and sharing. Their methodology is based on hardware counters. They found that kernel time is negligible (less
than 6%). Major sources of processor stalls are instruction fetch and data miss in outer level caches. They found lower miss
rates for data caches in comparison with other studies on TPC-C [3], [25]. This is due to the smaller working set of TPC-D
compared with TPC-C.
Ranganathan et al. consider both an OLTP workload (modeled after TPC-B [43]) and a DSS workload (query 6 of TPC-
D [44]) [31]. Their study is based on trace-driven simulation, where traces are collected on a 4-processor AlphaServer4100
running Digital Unix and Oracle 7 DBMS. The simulated system is a CC-NUMA shared-memory multiprocessor with
6
advanced ILP support. Results, on a 4-processor system and an ILP configuration with 4-way issue, 64-entry instruction
window, 4 outstanding misses, provide already significant benefits for OLTP and DSS workload. Such configurations are
even less aggressive than ILP commercial processor like Alpha 21264, HP-PA 8000, MIPS R10000 [48]. The latter
processor, used in our evaluation, makes us reasonably safe that this processor architecture is sound for investigation in the
memory subsystem.
Another performance analysis of OLTP (TPC-B) and DSS (query 6 of TPC-D) workloads via simulation is presented in
[28]. The simulated system is a commercial CC-NUMA multiprocessor, constituted by four-processor SMP nodes connected
using a Scalable Coherent Interface based coherent interconnects. Coherence protocol is directory-based. Each node is based
on a Pentium Pro processor with a 512K-byte, 4-way set associative cache. Results show that TPC-D miss rate is much
lower and performance is less sensitive to L2 miss latency than in the OLTP (TPC-B) experiments. They analyze also
scalability exhibited by such workloads. The speed-up of the DSS workload is near to the theoretical ones. Results show that
scalability strongly depends on miss rate.
Lo et al. [27] analyze the performance of database workloads running on simultaneous multithreading processors [13] -
an architectural technique to hide memory and functional units latencies. The study is based on trace-driven of a four
processor AlphaServer4100 and Oracle 7 DBMS as in [3]. They consider both an OLTP workload (modeled after the TPC-B
[43] benchmark) and a DSS workload (query 6 of the TPC-D [44] benchmark). Results show that while DBMS workloads
have larger memory footprints, there is a substantial data reuse in a small working set.
Summarizing, these studies considered DSS workloads but they were mostly limited to four-processor systems, did not
consider the effects of process migration, and did not correlate the amount of sharing to the performance of the system. As
we have more processors, it becomes crucial to characterize further the memory subsystem. In our work, we investigated
both 4- and 16- processor configurations, finding that larger caches may have several drawbacks due to coherence overhead.
This is mostly related to the use of shared structures like indices and locks in TPC-D workload. In Section 5, we will classify
the sources of this overhead and propose solutions to overcome limitations to the performance related to process migration.
4 Methodology and Workload The methodology that we used is based on trace-driven simulation [35], [29], [46] and on the simulation of the three
kernel activities that most affect performance: system calls, process scheduling, and virtual-to-physical address translation.
The approach used is to produce a source trace - a sequence of memory references, system-call positions (and
synchronization events if needed) - by means of a tracing tool. Trace Factory then models the execution of complex
multiprogrammed workloads by combining multiple source traces and simulating system calls (which could also involve I/O
activity), process scheduling and virtual-to-physical address translation. Finally, Trace Factory produces the references
(target trace) furnished as input to a memory-hierarchy simulator [29]. Trace Factory generates references according to an
on-demand policy: it produces a new reference when simulator requests one, so that the timing behavior imposed by the
memory subsystem conditions the reference production [17]. Process management is modeled by simulating a scheduler that
7
dynamically assigns a ready process. Virtual-to-physical address translation is modeled by mapping sequential virtual pages
into non-sequential physical pages. A careful evaluation of this methodology has been carried out in [17].
Table 1. Statistics of source traces for some Unix commands and daemons
and for multiprocess source traces (PostgreSQL, TPC-D queries) both in case of 64-byte block size and 10,000,000 references per process.
DATA (%) SHARED DATA (%) APPLICATION NO. OF
PROCESSES DISTINCTBLOCKS
CODE (%) READ WRITE
SHAREDBLOCKS ACCESS WRITE
awk (beg) 1 4963 76.76 14.76 8.48 n/a n/a n/a awk (mid) 1 3832 76.59 14.48 8.93 n/a n/a n/a cp 1 2615 77.53 13.87 8.60 n/a n/a n/a rm 1 1314 86.39 11.51 2.10 n/a n/a n/a ls –aR 1 2911 80.62 13.84 5.54 n/a n/a n/a telnetd 1 463 82.75 12.96 4.29 n/a n/a n/a crond 1 2464 75.86 16.35 7.79 n/a n/a n/a syslogd 1 2848 80.41 14.96 4.63 n/a n/a n/a TPC-D18 18 139324 73.05 16.89 10.06 7224 1.58 0.43 TPC-D12 12 93657 73.04 16.91 10.05 5662 1.55 0.41
Table 2. Statistics of multiprogrammed target traces (DSS26 , 260,000,000 references; DSS18 , 180,000,000 references) in case of 64-byte block size.
DATA (%) SHARED DATA (%) WORKLOAD NO. OF
PROCESSES DISTINCTBLOCKS
CODE (%)
READ WRITE
SHAREDBLOCKS
ACCESS WRITE DSS26 26 179862 74.59 16.26 9.15 7806 1.76 0.53 DSS18 18 124268 74.00 16.11 8.89 6242 1.63 0.48
The workload considered in our evaluation includes DB activity reproduced by means of an SQL server, namely
PostgreSQL [49], which handles the TPC-D [44] queries. We also included Unix utilities that access the file system,
interface the various programs running on the system, and reproduce the activity of typical Unix daemons.
PostgreSQL is a public domain DBMS, which relies on server-client paradigm. It consists of a front-end process that
accepts SQL queries, and a back-end that forks processes, which manage the queries. TPC-D is a benchmark for DSS
developed by the Transaction Processing Performance Council [44]. It simulates an application for a wholesale supplier that
manages, sells, and distributes a product worldwide. Following TPC-D specifications, we populated the database via the
dbgen program, with a scale factor of 0.1. The data are organized in several tables and accessed by 17 read-only queries and
2 update queries.
In a typical situation, application and management processes can require the support of different system commands and
ordinary applications. To this end, Unix utilities (ls, awk, cp, and rm) and daemons (telnetd, syslogd,
crond) have been added to the workload. These utilities are important because they model the ‘glue’ activity of the system
software. These utilities: i) do not have shared data and thus they increase the effects of process migration, as discussed in
detail in the Section 5; ii) they may interfere with shared data and code cache-footprint of other applications. To take into
account that requests may be using the same program at different times, we traced some commands in shifted execution
sections: initial (beg) and middle (mid).
In our experiments, we generated two distinct workloads. The first workload (DSS26 in Table 2) includes the TPC-D18
source trace (Table 1). TPC-D18 is a multiprocess source trace taking into account the activity of 18 processes. One process is
8
generated by DBMS back-end execution, whilst the other processes are generated by the concurrent execution of the 17
read-only TPC-D queries (TPC-D18 in Table 2). Since we wanted to explore critical situations for the affinity scheduling –
when the number of process changes, - we also generated a second workload (DSS18 in Table 2) that includes a subset of
TPC-D queries (TPC-D12). Table 1 contains some statistics of the uniprocess and multiprocess source traces used to generate
the DB workloads (target traces, Table 2). Both source traces related to DBMS activity (TPC-D18, TPC-D12) and the
resulting workloads (DSS26 e DSS18) present similar characteristics in terms of read, write and shared accesses.
5 Results In this Section, we wish to show the memory subsystem performance, for our DSS workload, with a detailed
characterization of coherence overhead and process migration problems. To this end, we included results for several values
of the most influencing cache-architecture parameters. Finally, we considered critical situations for the affinity scheduling
and we analyzed the sensitivity to cache-to-cache latency. Our results show that solutions that allow us to achieve a higher
scalability are possible for this kind of machine – compared to standard solutions, - and consequently a greater performance
at a reasonable cost.
5.1 Design Space of our System The simulated system consists of N processors, which are interconnected to a 128-bit shared bus for accessing shared
memory. The following coherence schemes have been considered: AMSD, MESI, and PSCR (more details are in Section 2).
We considered two main configurations: a basic machine with 4 processors and a high-performance one with 16 processors.
The scheduling policy is based on cache-affinity; scheduler time-slice is 200,000 references. Cache size has been varied
between 512K bytes and 2M bytes, while for block size we used 64 bytes and 128 bytes. The simulated processors are
MIPS-R10000-like; paging relays on 4-KByte-page size; the bus is pipelined, supports transaction splitting, and processor-
consistency memory model [15]; up to 8 outstanding misses are allowed [26]. The base case study timings and parameter
values for the simulator are summarized in Table 3.
Table 3. Input parameters for the multiprocessor simulator (timings are in clock cycles)
Class Parameter Timings CPU Read cycle 2
Write cycle 2 Cache Cache size (Bytes) 512K, 1M, 2M
Block size (Bytes) 64, 128 Associativity (Number of Ways) 1, 2, 4
Bus Write transaction (PSCR) 5 Write for invalidate transaction (AMSD, MESI) 5 Invalidate transaction (AMSD) 5 Memory-to-cache read-block transaction 72 (block size 64 bytes), 80 (block size 128 bytes) Cache-to-cache read-block transaction 16 (block size 64 bytes), 24 (block size 128 bytes) Update-block transaction 10 (block size 64 bytes), 18 (block size 128 bytes)
5.2 Performance Metrics To investigate the effects on the performance due to memory subsystem operations, we analyzed the causes influencing
memory latency and the total execution time. The memory latency, depends on the time necessary to perform a bus operation
9
(Table 3) and on the waiting time to access bus (bus latency). Bus latency depends on the amount and kind of bus traffic.
Bus traffic is constituted by read-block transactions (issued for each miss), update transactions and coherence transactions
(write transactions or invalidate transactions, depending on the coherence protocol). Consequently, miss and coherence
transactions affect performance because they affect the processor waiting-time directly (in the case of read misses) and
contribute to bus latency. Therefore, to investigate the sources of performance bottlenecks, we reported a breakdown of
misses - which includes invalidation misses and classical misses (sum of cold, capacity and conflict misses) - and the
“number of coherence transaction per 100 memory references” - which includes either write-transactions or invalidate
transactions depending on the coherence protocol. The rest of traffic is due to update transactions. Update transactions are a
negligible part of bus-traffic (lower than 7% of read-block transactions or about 1% of bus occupancy) and thus they do not
influence greatly our analysis.
Normalized Execution Time
0,90,910,920,930,940,950,960,970,980,99
1
512K
/1
512K
/2
512K
/4
1M/1
1M/2
1M/4
2M/1
2M/2
2M/4
Cache Size / Number of Ways
MESI AMSD PSCR
Figure 1. Normalized Execution Time versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4) and coherence protocol (AMSD, MESI, PSCR). Data assume 4 processors, 64-byte block size and a cache affinity scheduler. Execution Times are normalized with respect to the MESI – 512K, direct access architecture. PSCR presents the lowest Execution
Time, whilst MESI the highest.
In our analysis, we differentiated between Cold Misses and Capacity+Conflict Misses and Invalidation Misses [22].
Cold Misses include first access misses by a given process when it is scheduled either for the first time or it is rescheduled
on another processor (the latter are also known as context-switch misses). Capacity+Conflict Misses include misses caused
by memory references of a process competing for the same block (intrinsic interference misses in [1]), and misses caused by
references of sequential processes, executing on the same processor and competing for the same cache block (extrinsic
interference misses in [1]). Invalidation Miss is due to accesses to data that are going to be reused on the same processor, but
that have been invalidated in order to maintain coherence. Invalidation Misses are further classified, along with the
coherence transaction type, by means of an extension of an existing classification algorithm [23]. Our algorithm extends this
classification to the case of passive sharing, finite size caches, and process migration [14]. In particular, Invalidation Misses
are differentiated as true sharing misses, false sharing misses, passive sharing misses. True sharing misses and false sharing
misses are classified according to an already know methodology [42], [12], [11]. Passive sharing misses are invalidation
misses generated by the useless coherence maintaining of private data [18]. Clearly, these private data could appear as shared
10
to the coherence protocol, because of the process migration. Similarly, the coherence transactions are classified as true, false,
and passive sharing transactions either in the case they are invalidation or write transactions [14].
Miss Rate (%)
0
0,1
0,2
0,3
0,4
0,551
2K/1
/MES
I
512K
/1/A
MSD
512K
/1/P
SCR
512K
/2/M
ESI
512K
/2/A
MSD
512K
/2/P
SCR
512K
/4/M
ESI
512K
/4/A
MSD
512K
/4/P
SCR
1M/1
/MES
I
1M/1
/AM
SD
1M/1
/PSC
R
1M/2
/MES
I
1M/2
/AM
SD
1M/2
/PSC
R
1M/4
/MES
I
1M/4
/AM
SD
1M/4
/PSC
R
2M/1
/MES
I
2M/1
/AM
SD
2M/1
/PSC
R
2M/2
/MES
I
2M/2
/AM
SD
2M/2
/PSC
R
2M/4
/MES
I
2M/4
/AM
SD
2M/4
/PSC
R
Cache Size / Number of ways / Coherence Protocol
Capacity + Conflicy Miss Invalidation MissCold Miss
Figure 2. Breakdown of miss rate versus cache size (512 K, 1M, 2M bytes), number of ways (1, 2, 4) and coherence protocol
(AMSD, MESI, PSCR). Data assumes 4 processors, affinity scheduling and 64-byte block.
Invalidation Miss Rate (%)
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
512K
/1/M
ESI
512K
/1/A
MSD
512K
/1/P
SCR
512K
/2/M
ESI
512K
/2/A
MSD
512K
/2/P
SCR
512K
/4/M
ESI
512K
/4/A
MSD
512K
/4/P
SCR
1M/1
/MES
I
1M/1
/AM
SD
1M/1
/PSC
R
1M/2
/MES
I
1M/2
/AM
SD
1M/2
/PSC
R
1M/4
/MES
I
1M/4
/AM
SD
1M/4
/PSC
R
2M/1
/MES
I
2M/1
/AM
SD
2M/1
/PSC
R
2M/2
/MES
I
2M/2
/AM
SD
2M/2
/PSC
R
2M/4
/MES
I
2M/4
/AM
SD
2M/4
/PSC
R
Cache Size / Number of Ways / Coherence Protocol
False Sharing Miss (Kernel)True Sharing Miss (Kernel)False Sharing Miss (User)True Sharing Miss (User)Passive Sharing Miss
Figure 3. Breakdown of miss rate versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4) and coherence protocol
(AMSD, MESI, PSCR). Data assume 4 processors, affinity scheduling and 64-byte block.
11
5.3 Analysis of the Reference System We started our analysis from a 4-processor machine similar to cases used in literature [45], [3], [4], [31], running the
DSS26 workload (Table 2).
In detail, the reference 4-processor machine has a 128-bit bus and 64-byte block size. We varied cache size (from 512K
to 2M byte) and cache associativity (1, 2, 4).
Execution Time (Figure 1) is affected by the cache architecture. It decreases with larger cache sizes and/or more
associativity. Anyway, this variation is limited to an 8% between the less (512K byte, one way) and most performing (2M
byte, four way) configuration. In this case, the role of the coherence protocol is less important, with a difference among the
various protocols, for a given setup, which is less than 1%.
We analyzed the reason for this performance improvement (Figure 2) by decomposing the miss rate in term of traditional
(cold, conflict and capacity) and invalidation misses. In Figure 3, we show the contribution of each kind of sharing to the
Invalidation Miss Rate and, in Figure 4, to the Coherence-Transaction “Rate” (i.e. the number of coherence transactions per
100 memory references). We also differentiated between kernel and user overhead.
As expected, cache size mainly influences capacity misses. Invalidation misses increase slightly when increasing cache
size or associativity (Figure 2). For cache sizes larger than 1M bytes, cold misses are the dominating part and invalidation
misses weigh more and more (at least 30% of total misses). This indicates that, for our DSS workload, caches larger than 2M
bytes already capture the main working set. In fact, for cache sizes larger than 2M bytes, Cold Misses remain constant,
invalidation misses increase and only Capacity+Conflict misses may decrease, but their contribution to miss rate is already
minimal. The solution based on PSCR presents the lowest miss rate, and negligible invalidation misses.
Number of Coherence Transactions per 100 Memory References
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
512K
/1/M
ESI
512K
/1/A
MSD
512K
/1/P
SCR
512K
/2/M
ESI
512K
/2/A
MSD
512K
/2/P
SCR
512K
/4/M
ESI
512K
/4/A
MSD
512K
/4/P
SCR
1M/1
/MES
I
1M/1
/AM
SD
1M/1
/PSC
R
1M/2
/MES
I
1M/2
/AM
SD
1M/2
/PSC
R
1M/4
/MES
I
1M/4
/AM
SD
1M/4
/PSC
R
2M/1
/MES
I
2M/1
/AM
SD
2M/1
/PSC
R
2M/2
/MES
I
2M/2
/AM
SD
2M/2
/PSC
R
2M/4
/MES
I
2M/4
/AM
SD
2M/4
/PSC
R
Cache Size / Number of Ways / Coherence Protocol
False Sharing (Kernel)True Sharing (Kernel)False Sharing (User)True Sharing (User)Passive Sharing
Figure 4. Number of coherence transactions versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4), and coherence protocol (AMSD, MESI, PSCR). Coherence transactions are invalidate transactions in MESI, and AMSD, write transactions in
PSCR. Data assume 4 processors, 64-byte block size, and an affinity scheduler.
12
Our analysis of coherence overhead (Figures 3 and 4) confirms that the major sources of overhead are invalidation
misses for WI protocols (MESI and AMSD) and write-transactions for PSCR. Passive sharing overhead, either as
invalidation misses, or coherence transactions, is minimal: this means that affinity scheduling performs well. In the user part,
true sharing is dominant. In the kernel part, false sharing is the major portion of coherence overhead.
The cost of misses is dominating the performance and indeed we show in Figure 1 that PSCR is able to achieve the best
performance compared with the other protocols. The reason is the following: what PSCR looses in terms of extra coherence
traffic, is then gained as saved misses. Indeed, misses are more costly in terms of bus timings and read-block transactions
may produce a higher waiting time for the processor. Also, AMSD performs better than MESI. due to the reduction of
Invalidation Miss Rate overhead (Figure 2).
Our conclusions, for the 4-processor configuration, agree with previous studies as for the analysis of miss rate and the
effects of coherence maintaining [3], [4], [45].
Normalized Execution Time
0,650,7
0,750,8
0,850,9
0,951
512K
/1
512K
/2
512K
/4
1M/1
1M/2
1M/4
2M/1
2M/2
2M/4
Cache Size / Number of Ways
MESI AMSD PSCR
Figure 5. Normalized Execution Time versus cache size (512K, 1M, 2M bytes), number of ways (1, 2, 4), and coherence
protocol (AMSD, MESI, PSCR). Data assume 16 processors, 64-byte block size, and an affinity scheduler. Execution Times are normalized with respect to the Execution Time of the MESI, 512K, direct access configuration.
Miss Rate (%)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MESI/4 MESI/16 AMSD/4 AMSD/16 PSCR/4 PSCR/16Coherence Protocol / Number of CPU
Capacity + Conflict MissInvalidation MissCold Miss
Figure 6. Breakdown of miss rate versus coherence protocol (AMSD, MESI, PSCR) and number of processors (4,16). Data assume an affinity scheduler, 64-byte block, 1M-cache size two-way set associative. The higher number of processor causes more coherence misses (false plus true sharing) and more capacity and conflict misses. It is interesting to observe that while
the aggregate cache size increases (from 4M to 16M bytes), Capacity and Conflict misses also increase. This situation is different from the uniprocessor case, where Capacity+Conflict misses always decrease when cache size increases. This effect
is due to process migration.
13
5.4 Analysis of the High-End System In the previous baseline case analysis, we have seen that the four processor machine is efficient: architecture variations
do not produce further gains (e.g. the performance differences among protocol are small). This kind of machine may not
satisfy the performance needs of DSS workloads, and more performing systems are demanded. Given current processor-
memory speeds, we considered a ‘high-end’ 16-processor configuration. A detailed analysis of the limitations of this system
shows that this architecture makes sense, i.e. a high-end SMP system results efficient, if we solve the problems produced by
the process migration and adapt the coherence protocol to the access pattern of the workload. This architecture has not been
analyzed in literature, and still represents a relatively economic solution to enhance the performance.
In v a lid a t io n M is s R a te (% )
0
0 .0 5
0 .1
0 .1 5
0 .2
0 .2 5
M E S I/4 M E S I/1 6 A M S D /4 A M S D /1 6 P S C R /4 P S C R /1 6C o h e re n c e P r o to c o l / N u m b e r o f C P U
F a ls e S h a r in g M is s (K e rn e l)T ru e S h a r in g M is s (K e rn e l)F a ls e S h a r in g M is s (U s e r )T ru e S h a r in g M is s (U s e r )P a s s iv e S h a r in g M is s
Figure 7. Breakdown of Invalidation Miss Rate versus coherence protocol (AMSD, MESI, PSCR) and number of processors (4,16). Data assume, an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Having more processors
causes more coherence misses (false and true sharing). Passive sharing misses are slightly higher in PSCR compared to the other two protocols. This is a consequence of the selective invalidation mechanism of PSCR. In fact, as soon as a private
block is fetched on another processor, PSCR invalidates all the remote copies of that block. In the other two protocols, the invalidation is performed on a write operation on shared data, thus less frequently than in PSCR.
N u m b e r o f C o h e r e n c e T r a n s a c t io n s p e r 1 0 0 M e m o r y R e fe re n c e s
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
M E S I/4 M E S I /1 6 A M S D /4 A M S D /1 6 P S C R /4 P S C R /1 6
C o h e r e n c e P ro to c o l / N u m b e r o f C P U
F a ls e S h a r in g T ra n s a c t io n s (K e rn e l)T ru e S h a r in g T ra n s a c t io n s (K e rn e l)F a ls e S h a r in g T ra n s a c t io n s (U s e r )T ru e S h a r in g T ra n s a c t io n s (U s e r )P a s s iv e S h a r in g T ra n s a c t io n s
Figure 8. Number of coherence transactions versus coherence protocol (AMSD, MESI, PSCR) and number of processors (4, 16). Data assume, an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. There is an increment in the sharing overhead in all of its components. This increment is more evident in the WI class protocols, also because there is
more passive sharing overhead.
14
In the following, we will compare our results directly with the four-processor case and show the sensitivity to classical
cache parameters (5.4.1). We will consider separately the case of different block size (5.4.2), the case of a different workload
pressure in terms of number of processes (5.4.3), and the case of a different cache-to-cache latency (5.4.4).
5.4.1 Comparison with 4-processor case and sensitivity to cache size and associativity In Figure 5, we show the Execution Time for several configurations. Protocols designed to reduce the effects of process
migration achieve better performance. In particular, the performance gain of PSCR over MESI is at least 13% in all
configurations. The influence of cache parameters is stronger, with a 30% difference between the most and less performing
configuration.
We clarify the relationship between execution Time and process migration, by showing the Miss Rate (Figure 6) and the
Number of Coherence Transactions (Figure 8). In detail, we show the breakdown of the most varying components of Miss
Rate (Invalidation Miss, Figure 7) and breakdown of coherence transactions (Figure 8). In the following, for the sake of
clearness, we assume a 1-Mbyte-cache size, a 64-byte block size, and 2-ways.
The Execution Time is mainly determined by the cost of read-block transactions and the cost of coherence-maintaining
transactions (see also Section 5.2). Let us take MESI as baseline. AMSD has a lower Execution Time because of its lower
Miss Rate (Figure 6, and in particular Invalidation Miss Rate, Figure 7). The small variation in the Number of Coherence
Transactions (Figure 8) does not weigh much on the performance. PSCR gains strongly from the reduction of Miss Rate,
while it is not penalized too much by the high Number of Coherence Transactions (Figure 6 and 8). In detail:
•
Classical misses (Cold and Conflict+Capacity) do not vary with the coherence protocol, although they increase while
switching from 4 to 16 processors because of the higher migration of the processes: for the Cold Miss portion, because a
process may be scheduled on a higher number of processors, and for the Conflict+Capacity, because a process has less
opportunities of reusing its working set and it destroys the cache footprint generated by other processes.
• The much lower Miss Rate in PSCR is due to its low Invalidation Miss Rate (Figure 6). In particular PSCR does not
have invalidation misses due to true and false sharing, neither in the user nor in the kernel mode (Figure 7). On the contrary,
in the other two protocols the latter factors weigh very much. This effect is more evident in the 16-processor case as
compared to the 4-processor case because of the higher probability of sharing data, produced by the increased number of
processors.
•
The DSS workload, in the PostgreSQL implementation, exhibits an access pattern to shared data which speed-up the WU
strategy for maintain coherence among shared-data with respect to the WI strategy. We can explain such pattern with the
following considerations: the TPC-D DSS queries are read-only queries and consequently, as already observed in [45], [52]
coherence misses and transactions involve PostgreSQL metadata (i.e. data structures that are needed by the DBMS to work
rather than to store the data) and particularly data structure needed to implement software locks [45], [52]. The type of
access pattern to those data is one-producer/many-consumers. Such pattern advantage WU protocols, especially if there is a
high number of processes (the ‘consumers’) that concurrently read shared data. When the number of processors switches
from 4 to 16, the number of consumer processes increases, due to the higher number of processes concurrently in execution
15
and, consequently, true sharing Invalidation Misses and Transactions (in the user part) increase especially for WI protocols
(Figure 7 and 8), thus speeding-up the performance of the WU solution (PSCR). However, we cannot utilize a pure WU
protocol, as passive sharing dramatically decreases performances [47].
N o r m a liz e d E x e c u t io n T im e
0 ,6 5
0 ,7
0 ,7 5
0 ,8
0 ,8 5
0 ,9
0 ,9 5
1
M E S I /6 4 M E S I /1 2 8 A M S D /6 4 A M S D /1 2 8 P S C R /6 4 P S C R /1 2 8
C o h e r e n c e P r o t o c o l / B lo c k S iz e Figure 9. Normalized Execution Time versus block size size (64, 128 bytes) and coherence protocol (AMSD, MESI, PSCR). Data assume, an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Execution Times are normalized
with respect to the execution time of the MESI, 64 bytes block configuration.
Miss Rate (%)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MESI/64 MESI/128 AMSD/64 AMSD/128 PSCR/64 PSCR/128
Coherence Protocol / Block Size
Capacity + Conflict MissInvalidation MissCold Miss
Figure 10. Breakdown of miss rate versus coherence protocol (AMSD, MESI, PSCR) and block size (64 byte, 128 byte). Data
assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. There is a decrease of Cold, Capacity+Conflict miss components, and a little decrease of invalidation miss component.
5.4.2 Analyzing the effects of a larger block size in the ‘High-End’ System We started our analysis from the 64-bytes block size for references comparison with other studies. When switching from
64 to 128 bytes, PSCR has further advantages in respect of the other two considered protocols (Figure 9). This is due to the
following two reasons. First, we observe a reduction of Capacity+Conflict miss component (Figure 10), a small reduction of
coherence traffic (Figure 12), and Invalidation Miss Rate (Figure 11). Secondly, in the case of 64-byte block, the system is in
saturation1 [18] for all configurations of Figure 9. In the case of 128-byte blocks, an architecture based on PSCR is not
saturated, and thus we can use configurations with a higher number of processors efficiently. When switching from 64 to 128
bytes, the decrease of the Execution Time is 27% percent for PSCR, and only a 20% for the other protocols. We observe that
1 We recall that a shared-bus shared-memory SMP system is in saturation [18] when the performance does not increase at least of a given quantity, when we add one processor to the machine.
16
a block size larger than 128 bytes produces diminishing returns, because the increased cost of read-block transaction is not
compensated by the reduction of the number of misses. Similar result has been obtained for a CC-NUMA machine running a
DSS workload [28]. In that work, the number of “Effective Processors” for a 16-processor CC-NUMA system was almost
the same that we obtained for our cheaper shared-bus shared-memory system (figure not showed).
Invalidation Miss Rate (%)
00,050,1
0,150,2
0,25
MESI/64 MESI/128 AMSD/64 AMSD/128 PSCR/64 PSCR/128
Coherence Protocol / Block Size
False Sharing Miss (Kernel)True Sharing Miss (Kernel)False Sharing Miss (User)True Sharing Miss (User)Passive Sharing Miss
Figure 11. Breakdown of Invalidation miss rate versus coherence protocol (AMSD, MESI, PSCR) and block size (64 byte,
128 byte). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Passive Sharing Misses decrease when increasing block size because the invalidation unit is larger.
N u m b e r o f C o h e re n c e T ra n s a c tio n s p e r 1 0 0 M e m o ry R e fe re n c e s
0
0 ,1
0 ,2
0 ,3
0 ,4
0 ,5
0 ,6
M E S I/6 4 M E S I/1 2 8 A M S D /6 4 A M S D /1 2 8 P S C R /6 4 P S C R /1 2 8C o h e re n c e P ro to c o l / B lo c k S iz e
F a ls e S h a r in g T ra n s a c tio n s (K e rn e l)T ru e S h a r in g T ra n s a c tio n s (K e rn e l)F a ls e S h a r in g T ra n s a c tio n s (U s e r )T ru e S h a r in g T ra n s a c tio n s (U s e r)P a s s iv e S h a r in g T ra n s a c tio n s
Figure 12. Number of coherence transactions versus coherence protocol (AMSD, MESI, PSCR) and block size (64, 128 byte).
Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches.
5.4.3 Analyzing the effects of variations in the number of processes of the workload We considered another scenario, where the number of processes in the workload may vary and thus the scheduler could
fail in applying affinity. The affinity scheduling could fail when the number of ready processes is limited. We defined a new
workload (DSS18, Table 2) having characteristics similar to DSS26 workload that used in the previous experiments, but
constituted of only 18 processes. The machine under study is still the 16-processor one. In such a condition, the scheduler
can only choose between at most two ready processes. The measured miss rate and number of coherence transactions
(Figures 14, 15, 16) shows an interesting behavior. The miss rate, and in particular Cold, Conflict+Capacity miss rate,
increases in respect to the DSS26 workload. This is consequence of process migration, and it is determined by the failure of
the affinity requirement: as the number of processes is almost equal to the number of processors, it is not always possible for
the system to reschedule a process on the processor where it last executed. In such cases, PSCR can reduce greatly the
associated overhead and it achieves the best performance (Figure 13).
17
The main conclusion here is that PSCR maintains its advantage also in different load conditions, while the other
protocols are more penalized by critical scheduling conditions.
Norm alized Execution T im e
0,85
0,9
0,95
1
1,05
1,1
1,15
1,2
1,25
1,3
ME SI/DS S26 MES I/DSS 18 A MSD/D SS26 A MSD /DSS 18 PS CR/D SS 26 P SCR /DSS 18
Coherence Protocol / Num ber of Processes
Figure 13. Normalized Execution Time versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 – DSS18). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches.
Miss Rate (%)
00,10,20,30,40,50,60,70,8
MESI/DSS26 MESI/DSS18 AMSD/DSS26 AMSD/DSS18 PSCR/DSS26 PSCR/DSS18Coherence Protocol / Number of Processes
Capacity + Conflict MissInvalidation MissCold Miss
Figure 14. Breakdown of Miss Rate versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 – DSS18). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. The workload DSS18 exhibits the higher miss rate, due to an increased number of Cold, Capacity, and Conflict Miss. This is a consequence of process migration, that
affinity fails to mitigate.
Invalidation Miss Rate (%)
00.050.1
0.150.2
0.250.3
0.35
MESI/DSS26 MESI/DSS18 AMSD/DSS26 AMSD/DSS18 PSCR/DSS26 PSCR/DSS18
Coherence Protocol / Number of Processes
False Sharing Miss (Kernel)True Sharing Miss (Kernel)False Sharing Miss (User)True Sharing Miss (User)Passive Sharing Miss
Figure 15. Breakdown of Invalidation Miss Rate versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 – DSS18). Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set associative caches. Passive sharing misses
decrease, while true and false sharing misses increase. This is consequence of the failure of affinity scheduling: in the DSS18 execution. Processes migrate more than in DSS26 execution, thus generating more reuse of shared data (and more invalidation misses on shared data) but lower reuse of private data (and lower number of passive sharing misses).
18
Number of Coherence Transactions per 100 M emory References
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MESI/DSS26 MESI/DSS18 AMSD/DSS26 AMSD/DSS18 PSCR/DSS26 PSCR/DSS18
Coherence Protocol / Number of Processes
False Sharing Transactions (Kernel)True Sharing Transactions (Kernel)False Sharing Transactions (User)True Sharing Transactions (User)Passive Sharing Transactions
Figure 16. Breakdown of Coherence Transactions versus coherence protocol (AMSD, MESI, PSCR) and workload (DSS26 –
DSS18). Data assume affinity, 64-byte block, 1M-byte 2-way set associative caches. Norm alized Execution T im e
0,8
0,85
0,9
0,95
1
1,05
1,1
1,15
1,2
1,25
1,3
1/16 1/8 1/4 1/2 1(Cache-to-Cache Latency/Read-Block Latency) Ratio
PSCRAM SDMESI
Figure 17. Normalized Execution Time versus coherence protocol (AMSD, MESI, PSCR) and the ratio between cache-to-cache transfer latency and read-block latency. Data assume an affinity scheduler, 64-byte block, 1M-byte 2-way set
associative caches. Execution Times are normalized with respect to the execution time of the MESI, 1/16 configuration.
5.4.4 Analyzing the effects of variations in the cache-to-cache transfer latency In our implementation of the coherence protocols, we assume that, when a data, which is missing in a local cache, is
present in a remote cache, that cache supplies the data, because the transfer between caches is usually faster than the transfer
between cache and memory (see Table 3). This is compatible with the early design of bus-based systems, and it derives from
the assumption that cache could supply data more quickly than memory. This is especially true in the case of small-scale
SMP architectures [9]. However, this advantage is not necessary present in modern high-end SMP architecture [9], such as
the Sun series based on the Wildfire [21] or the Fireplan bus [7], in which cache-to-cache latency is similar to the latency to
access the main memory. As previous studies of commercial workloads have highlighted that a large portion of misses can
be served by cache-to-cache transfer [25], [31], we analyze the effect on the Execution Time of different cache-to-cache
latencies. Figure 17 shows the Execution Time for the three protocols, when cache-to-cache transfer latency is varied. The
cases analyzed in previous analysis assumed a ratio of 1/16 for the ratio between cache-to-cache latency and read-block
latency.
19
Obviously, when we increase the latency, Execution Time increases. However, the architecture based on PSCR shows
less sensitivity to this parameter. This is consequence of the lower number of cache-to-cache transfers needed in PSCR with
respect to the other protocols. In fact, as a benefit of the WU strategy, shared data are never invalidated in PSCR. This also
generates several updated copies in several caches: a migrating process does not need to reload them from remote cache or
memory. When cache-to-cache latency is increased up to read-block latency, the advantage of PSCR is about 50% with
respect to the other two coherence protocols.
6 Conclusions We evaluated the memory performance of a shared-bus shared-memory multiprocessor running a DSS workload, by
considering several different choices that could improve the overall performance of the system. We considered different
architectures based on the following coherence protocols: MESI - a pure WI protocol, widely used in high performance
multiprocessors, - AMSD - a WI protocol designed to reduce effects of data migrations - and PSCR - a coherence protocol
using an hybrid strategy, that is WU for shared data and WI for private data, designed to reduce the effect of process
migration. The DSS workload was setup using the PostgreSQL DBMS executing queries of the TPC-D benchmark and
typical Unix shell commands and daemons. We considered kernel effects that are more relevant to our analysis like process
scheduling, virtual memory mapping, user/kernel code interactions.
Our conclusions, for the 4-processor case, agree with previous studies as for the analysis of miss rate and the effects of
coherence maintaining. Our analysis outlines also: i) cache sizes larger than 2M bytes already capture the working set of
such workload; ii) the kernel effects account for 50% of the coherence overhead. Previous studies that considered DSS
workloads were mostly limited to 4-processor systems, did not consider the effects of process migration, and did not
correlate the amount of sharing to the performance of the system.
Our analysis of a "high-end" machine considered a 16-processor SMP. We analyzed variations of classical cache
parameters, variations in the workload pressure on the scheduler due to a different number of processes, and variations of the
cache-to-cache latency. We found that in the high-end systems some factors, that were less noticed in the 4-processor case,
become more evident.
MESI protocol is not the best choice in high-end SMP architectures: AMSD improves the performance of a DSS system
of about 10% compared to MESI, PSCR improves the performance of about 20% compared to MESI when we use a low
cache-to-cache latency. DSS workloads running on SMP architectures generate a variable load. The affinity scheduler may
fail to deliver the affinity requirements. The use of PSCR allows us to build systems, whose performance is less influenced
by the load condition. The use of adequate protocols can deliver to DSS system an advantage, in case of SMP systems, with
high cache-to-cache latencies (up to 50% of improvement in our case).
References [1] A. Agarwal, Analysis of Cache Performance for Operating Systems and Multiprogramming. Kluwer Academic
Publishers, Norwell, MA, 1989.
20
[2] S. J. Eggers,“Simplicity Versus Accuracy in a Model of Cache Coherency Overhead”. IEEE Transactions on Computers, Vol. 40, No. 8, pp. 893-906, August 1991.
[3] L.A. Barroso, K. Gharachorloo, and E. Bugnion, “Memory System Characterization of Commercial Workloads”. Proc. of 25th Int.l Symp. on Computer Architecture, Barcelona, Spain, pp. 3-14, June 1998.
[4] Q. Cao, J. Torrellas, P. Trancoso, J. L. Larriba-Pey, B. Knighten, and Y. Won, “Detailed characterization of a quad Pentium Pro server running TPC-D”. Proc. of the Int.l Conf. on Computer Design, pp.108-115, Oct. 1999.
[5] R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum, “Scheduling and Page Migration for Multiprocessor Compute Servers”. Proc. of the 6th ASPLOS, pp. 12-24, Oct. 1994.
[6] J. Chapin, S. Herrod, M. Rosenblum, and A. Gupta “Memory System Performance of UNIX on CC-NUMA Multiprocessors”. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, pp. 1-13, May 1995.
[7] A. Charlesworth, “The Sun Fireplan System Interconnect”. Proc. of the Conf. on High Performance Networking and Computing (SC01), November 2001, http://www.sc2001.org/papers/pap.pap150.pdf.
[8] A. L. Cox and R. J. Fowler, “Adaptive Cache Coherency for Detecting Migratory Shared Data”. Proc. 20th Int.l Symp. on Computer Architecture, San Diego, California, pp. 98-108, May 1993.
[9] D. E. Culler and J. P. Singh, Parallel Computer Architecture: a Hardware/Software Approach. Morgan Kaufmann Publishers, San Francisco, CA, 1998.
[10] Z. Cvetanovic and D. Bhandarkar, “Characterization of Alpha AXP Performance Using TP and SPEC Workloads”. Proc. 21st Int.l Symp. on Computer Architecture, pp. 60-70, Apr. 1994.
[11] M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenström, “The Detection and Elimination of Useless Miss in Multiprocessor”. Proc. of 20th Int.l Symp. on Computer Architecture, San Diego, CA, pp. 88-97, May 1993.
[12] S. J. Eggers, T. E. Jeremiassen, “Eliminating False Sharing”. Proc. 1991 Int.l Conf. on Parallel Processing, August 1991, pp. I:377-381.
[13] S.J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm, and Dean M. Tullsen, “Simultaneous Multithreading: A Platform for Next-Generation Processors”. IEEE Micro, pp. 12-19, vol. 17, no. 5, Oct. 1997.
[14] P. Foglia, “An Algorithm for the Classification of Coherence Related Overhead in Shared-Bus Shared-Memory Multiprocessors”. IEEE TCCA Newsletter, pp. 40-46, Jan. 2001.
[15] K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors”. Proc. of the Fourth ASPLOS, Santa Clara, California, pp. 245-357, Apr. 1991.
[16] S. J. Eggers, R. H. Katz, “Evaluating the performance of four snooping cache coherency protocols “. Proc. of 16th Annual International Symposium on Computer Architecture, pp. 2-15, Jerusalem, Israel, 1989.
[17] R. Giorgi, C. Prete, G. Prina, and L. Ricciardi, “Trace Factory: a Workload Generation Environment for Trace-Driven Simulation of Shared-Bus Multiprocessor”. IEEE Concurrency, vol. 5, no. 4, pp. 54-68, October – Dec. 1997.
[18] R. Giorgi and C.A. Prete, “PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors”. IEEE Trans. on Parallel and Distributed Systems, pp. 742-763, vol. 10, no. 7, July 1999.
[19] S. J. Eggers, R. H. Katz, “A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation”. In. Proc. of 15th Annual Int. Symp. on Computer Architecture, pp. 373-382 Honolulu, Hl, May 1988.
[20] A. M. Grizzaffi Maynard, C. M. Donnelly, and B. R. Olszewski, “Contrasting characteristics and cache performance of technical and multi-user commercial workloads”. Proc. of the 6th ASPLOS, pp. 158-170, Oct. 1994.
[21] E. Hagersten and M. Koster, “WildFire: A Scalable Path for SMPs”. Proc. of the 5th Symp. on High-Performance Computer Architecture, pp. 172-181, Jan. 1999.
[22] J. Hennessy and D.A. Patterson, Computer Architecture: a Quantitative Approach, 3rd edition. Morgan Kaufmann Publishers, San Francisco, CA, 2002.
[23] R. L. Hyde and B. D. Fleisch, “An Analysis of Degenerate Sharing and False Coherence”. Journal of Parallel and Distributed Computing, vol. 34(2), pp. 183-195, May 1996.
[24] T. E. Jeremiassen, S. J. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations”. ACM SIGPLAN Notices, 30 (8), pp. 179-188, Aug. 1995.
[25] K. Keeton, D. Patterson, Y. He, R. Raphael, and W. Baker, “Performance characterization of a quad Pentium Pro SMP using OLTP workloads”. Proc. of the 25th Int.l Symp. on Computer Architecture, pp. 15-26, June 1998.
[26] D. Kroft, “Lockup-free Instruction Fetch/Prefetch Cache Organization”. Proc. of the 8th Int.l Symp. on Computer Architecture, pp. 81-87, June 1981.
[27] Jack L. Lo, Luiz A. Barroso, Susan J. Eggers, Kourosh G. Gharachorloo, Henry M. Levy, Sujay S. Pareck, “An analysis of Database Workload Performance on Simultaneous Multithreaded Processors”. Proc. of the 25th Annual Int.l Symp. on Computer Architecture, pp. 39-50, Barcelona, Spain, June 1998.
[28] T. Lovett, R. Clapp, “StiNG: A CC-NUMA Computer System for the Commercial MarketPlace”. Proc. of the 23rd Int.l Symp. on Computer Architecture, pp. 308-317, May 1996.
[29] C.A. Prete, G. Prina, and L. Ricciardi, “A Trace Driven Simulator for Performance Evaluation of Cache-Based Multiprocessor System”. IEEE Transactions on Parallel and Distributed Systems, vol. 6 (9), pp. 915-929, Sept. 1995.
[30] C. A. Prete, G. Prina, R. Giorgi, and L. Ricciardi, “Some Considerations About Passive Sharing in Shared-Memory Multiprocessors”. IEEE TCCA Newsletter, pp. 34-40, Mar. 1997.
[31] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. Barroso, “Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors”. Proc. of the 8th ASPLOS, pp. 307-318, San Jose, CA, 1998.
[32] T. Shanley and Mindshare Inc. Pentium Pro and Pentium II System Architecture. Addison Wesley, Reading, MA, 1999. [33] P. Stenström, M. Brorsson, and L. Sandberg, “An Adaptive Cache Coherence Protocol Optimized for Migratory
Sharing”. Proc. of the 20th Annual Int.l Symp. on Computer Architecture., May 1993. [34] P. Stenström, E. Hagersten, D.J. Li, M. Martonosi, and M. Venugopal, “Trends in Shared Memory Multiprocessing ”.
IEEE Computer, vol. 30, no. 12, pp. 44-50, Dec. 1997. [35] B. Stunkel, B. Janssens, K. Fuchs, “Address Tracing for Parallel Machines”. IEEE Computer, vol. 24 (1), pp. 31-45,
Jan.1991.
21
[36] P. Sweazey and A. J. Smith, “A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus”. Proc. of the 13th Int.l Symp. on Computer Architecture, pp. 414-423, June 1986.
[37] M. Tomasevic and V. Milutinovic, “The Cache Coherence Problem in Shared-Memory Multiprocessors –Hardware Solutions”. IEEE Computer Society Press, Los Alamitos, CA, Apr. 1993.
[38] M. Tomasevic and V. Milutinovic, “Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors”. IEEE Micro, vol. 14, no. 5, pp. 52-59, Oct. 1994 and vol. 14, no. 6, pp. 61-66, Dec.1994.
[39] M. Tomasevic and V. Milutinovic, “The Word-Invalidate Cache Coherence Protocol”. Microprocessors and Microsystems, pp. 3-16, vol. 20, Mar. 1996.
[40] J. Torrellas, M. S. Lam, and J. L. Hennessy, “Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates”. Proc. of 1990 Int.l Conf. on Parallel Processing, Urbana, IL, pp. 266-270, Aug. 1990.
[41] J. Torrellas, A. Gupta, and J. Hennessy, “Characterizing the caching and synchronization performance of a multiprocessor operating system”. Proc. 5th ASPLOS, pp. 162-174, Sept. 1992.
[42] J. Torrellas, M. S. Lam, and J.L. Hennessy, “False Sharing and Spatial Locality in Multiprocessor Caches”. IEEE Transactions on Computer, vol. 43, n. 6, pp. 651-663, June 1994.
[43] Transaction Processing Performance Council, “TPC Benchmark B (Online Transaction Processing) Standard Specification”. June 1994. http://www.tpc.org.
[44] Transaction Processing Performance Council, “TPC Benchmark D (Decision Support) Standard Specification”. December 1995. http://www.tpc.org.
[45] P. Trancoso, J. L. Larriba-Pey, Z. Zhang, and J. Torrellas, “The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors”. Proc. of the 3rd Int.l Symp. on High Performance Computer Architecture, pp. 250-260, Los Alamitos, CA, Feb. 1997.
[46] R. Uhlig, T. Mudge, “Trace-Driven Memory Simulation: a survey”. ACM Computing Surveys, pp. 128-170, June 1997. [47] C. A. Prete, G. Prina, R. Giorgi, L. Ricciardi, “Some Consideration about Passive Sharing in Shared-Memory
Multiprocessors”, IEEE TCCA Newsletter, March 1997. [48] K. C. Yeager, “The MIPS R10000 Superscalar Microprocessor”. IEEE Micro, vol. 16, n. 4, pp. 42-50, Aug. 1996. [49] A. Yu and J. Chen, “The POSTGRES95 User Manual”. Computer Science Div., Dept. of EECS, UCB, July 1995. [50] K. Keeton, R. Clapp, A. Nanda, “Evaluating Servers with Commercial Workloads”. IEEE Computer, Vol. 36, N. 2, pp.
29-32, Feb. 2003. [51] A. S. Tanenbaum, “Structured Computer Organization, 4ed.”. Prentice-Hall, Inc, 2001. [52] R. Yu, L. Bhuyan, R. Iyer, “Comparing the Memory System Performance of DSS Workloads on the HP V-Class and
SGI Origin 2000”. In Proc. of the Int. Parallel and Distributed Processing Symposium, Fort Lauderdale, FL, April 2002. [53] Ballinger C., “Relevance of the TPC-D Benchmark Queries:The Questions You Ask Every Day", NCR parallel System,
http://www.tpc.org/information/other/articles/TPCDart_0197.asp. [54] A. Ramírez, J.L. Larriba, C. Navarro, X. Serrano, J. Torrellas and M. Valero. "Code reordering of decission support
systems for optimized instruction fetch". IEEE International Conference on Parallel Processing. Japan, Sept. 1999. [55] P. Foglia, R. Giorgi, C.A. Prete: Analysis of Sharing Overhead in Shared Memory Multiprocessors. 31st IEEE Hawaii
Int. Conf. on Systems, Vol. 7, pp-776-778, Kohala Coast, HL, January 1998. [56] S. J. Eggers, R. H. Katz, “The Effect of Sharing on the Cache and Bus Performance of Parallel Programs”. Proc. 3rd
ASPLOS, pp. 257-270, Boston, MA, April 1989.
Toward Optimization of OpenMP Codes for Synchronization
and Data Reuse
Tien-hsiungWeng and Barbara Chapman Computer Science Department
University of Houston Houston, Texas 77024-3010
thweng, [email protected]
Abstract
In this paper, we present the compiler transformation of OpenMP code to an ordered collection of tasks, and the compile-time as well as runtime mapping of the resulting task graph to threads for data reuse. The ordering of tasks is relaxed where possible so that the code may be executed in a more loosely synchronous fashion. Our current implementation uses a runtime system that permits tasks to begin execution as soon as their predecessors have completed. A comparison of the performance of two example programs in their original OpenMP form and in the code form resulting from our translation is encouraging.
1 INTRODUCTION
OpenMP [1] succeeds as a popular parallel programming interface for medium scale high performance applications on shared memory systems due to its portability, ease-of-use and support for incremental parallelization. Writing OpenMP programs is relatively easy, in contrast to other parallel programming models such as MPI. Users may achieve reasonable performance by adding just a few OpenMP directives. However, it requires a non-trivial effort to obtain scalable OpenMP codes for ccNUMA systems. In order to make OpenMP a parallel programming model for a system consisting of a large number of processors, synchronization in the code must be minimized. An additional potential cause of performance degradation on ccNUMA (cache coherent Non-Uniform Memory Access) systems is that a thread may need to access a remote memory location; despite substantial progress in interconnection systems, this still incurs significantly higher latency than accesses to local memory, and the cost may be exacerbated by network contention. If data and threads are unfavorably mapped, cache lines may ping-pong between locations. However, OpenMP provides few features for managing data locality. In particular, it does not enable the explicit binding of parallel loop iterations to a processor executing a given thread; such a binding might allow the distribution of work such that threads can reuse data.
An alternative approach that can provide high performance on ccNUMA platforms is for the user to rewrite the OpenMP code, decomposing the data structures into portions that will be private to each thread, creating buffers to hold shared data and arranging for the appropriate synchronization. Yet this so-called Single Program Multiple Data (SPMD) OpenMP style requires considerable reprogramming. We attempt to relieve users from the extra burden of creating SPMD code and to help them achieve good performance for their original programs by transforming standard OpenMP codes to tasks, relaxing synchronization between tasks where possible, and mapping them to processors for data reuse.
2 OUR APPROACH
We target OpenMP to a data flow execution model realized using the SMARTS runtime system [7][8] developed at Los Alamos National Laboratory. At runtime, SMARTS employs a master thread to orchestrate the work of slave threads that execute the tasks. The tasks, a graph that represents the ordering of their execution and the data read and written by each task are passed to the runtime system. Before a task can run, it must make read/write requests for the data it needs. As soon as the data is available, the master thread schedules the task to a slave queue to be executed. The slave thread may be chosen by the compiler or by the runtime system. SMARTS also supports work stealing, where a slave thread takes one or more tasks from other queues.
The compiler transforms an OpenMP code to a collection of sequential tasks and a task graph, which represents the dependences between the tasks. This translation is based heavily on array region analysis, which must be applied interprocedurally. One of our goals is to perform as much work as possible to avoid expensive runtime computation of dependences; where necessary, code will be generated to complete the task graph and finalize the mapping to processors at run time. The performance of the resulting code will be determined by the suitability of the code for this approach, the quality of the translation, and the amount of data reuse by each processor to compensate the overheads incurred by the SMARTS runtime system. These overheads include managing requests for read and write data, as well as scheduling of tasks into task queues by SMARTS.
The compiler may specify the mapping of tasks to processors (or equivalently, to threads that are bound to processors) or may leave this to the runtime system. Finding a good compile-time mapping of nodes of the task graph to the processors for data locality reuse is the main focus of this paper. We use this mapping when we generate parallel code that includes calls to SMARTS. The resulting code is executed using the so-called first touch policy, where data will be stored so that it is local to the task that first accesses it. All tasks that use this data will be assigned to the same thread wherever possible to maximize data locality. We describe and discuss the mapping algorithm that we have developed below.
This paper is organized as follows. In the next section, we first discuss the generation of tasks and the task graph. We then present the algorithm for scheduling the tasks in the graph to threads for data locality reuse. Our strategy for code generation is also discussed. Then, in Section 4, we illustrate our work using an LU kernel. We also use an ADI code example to illustrate how our mapping algorithm enables macro-pipelined computations. Then, experimental results are presented. Finally, Section 5 presents related work and Section 6 our conclusions and future work.
3 CREATION AND MAPPING OF THE TASK GRAPH
For the purposes of this paper, we call a unit of sequential code that will be executed by a single thread a task. Such a task is an arbitrary code sequence, such as a set of iterations of a loop nest, or one or more procedures, without synchronization constructs. An OpenMP program will be decomposed into an ordered collection of tasks according to the semantics of the language. The ordering will be represented by the task graph. A task graph for an OpenMP program, denoted by G(N,E) consists of a set of nodes N=t1, t2, ..,tm, where each node represents a task in the decomposed program, and a set of edges E between nodes, where ei,j is an edge from node ti to node tj. Each edge represents synchronization in the sense that the source node must be executed before the sink node at run time. As part of our translation strategy, we will eliminate as many edges as possible from this graph in order to reduce the amount of synchronization imposed on the executing code. We will also use this graph and our analysis to map the tasks in the program to threads. Note that we bind exactly one thread to each target processor so that we may refer to
threads and processors interchangeably. We assume that processors (and threads) are numbered consecutively, beginning with 1.
3.1 Generation of tasks and task graph
OpenMP does not restrict work-sharing constructs to be within the lexical scope of the parallel region; they can occur within a subroutine or a procedure that is invoked, either directly or indirectly from inside a parallel region. Parallel regions may contain long call chains. Thus the work required to create tasks is of necessity interprocedural. We have developed an efficient call graph construction algorithm [34] to support recursive calls and dynamically bound calls arising from invocations of procedure-valued formal parameters. It provides a precise traversal of the generated call graph, since additional traversal information is saved. It has been implemented in our Dragon tool [36] that is based on Open64 [35]. The traversal algorithm is used to create task identification and gather array section information in an exact call chain of the program call graph. For each task, we collect the following information: 1) read/write access to array sections; 2) the code segment; 3) information on (possible) loop iterations; 4) synchronization constraints.
We follow the semantics of OpenMP worksharing constructs (DO, SECTIONS, MASTER, and SINGLE, etc.) to generate the sequential tasks. For example, the iterations of a parallel loop with a default schedule are assigned to threads so that each has a contiguous set of iterations of approximately the same size. If the loop does not contain any further worksharing or synchronization constructs, each such set of iterations would be a task in our program representation. A SINGLE or MASTER construct each represents a task, while a CRITICAL SECTION will be translated to one task per thread, each with the same code segment. Sequential regions of a code may be divided into basic blocks, each of which may form a task. The complete sequence of tasks of a program is generated interprocedurally in a traversal of our call graph. A typical OpenMP program will contain barrier synchronizations, and possibly atomic updates or critical sections to ensure ordered access to shared data. These synchronizations will be represented explicitly in the task graph that is also constructed. However, many of the resulting edges are in fact superfluous in this representation. For instance, an OpenMP code may require a barrier between two parallel loop nests, although the data reuse between them only occurs between two of the resulting tasks. All other tasks can be executed asynchronously.
In order to determine which edges can be eliminated from the graph, the compiler performs an array section analysis to determine as accurately as possible each task’s write and read accesses to shared data objects in the program. If we can use this analysis to prove that a pair of threads that are connected by an edge do not access the same data (or such accesses are reads only), then we may remove the corresponding edge from the task graph.
We have chosen to use standard regular sections, or the triplet notation, to represent data read and written, because of the efficiency with which we can compare them ([15], [9]); however more accurate analyses have been proposed, e.g., simple sections [10], regions [11][12][13], and the Omega test [17] and these are potential alternatives for compile time analysis when precision is critical. Efficiency and practicality are key for runtime synchronization analysis and regular sections are the natural choice for our runtime strategy where we must avoid overheads.
In order to improve the accuracy of our analysis, we use a delayed union approach when determining the region of an array that is accessed by a thread. Under this approach, two referenced sections of an array are merged if the merge does not lead to a loss of accuracy. Otherwise, each individual section is kept in a list. If a new reference cannot be merged with an existing regular section, it is added to the list until a threshold number of entries is achieved, in which case union will be enforced. This method may not improve accuracy in many cases.
The need for synchronization is given by comparing the data written by a thread with that which is read and written by a thread to which it is linked via an edge in the task graph. Where
compile-time analysis can prove that there is no common data, the edge is deleted from the graph. Where results are imprecise, runtime tests may be inserted to determine the need for an edge during program execution. It is worthwhile to do so when the test is fast, such as checking the range of a specific constant value.
When the number of threads is known in advance, task generation is straightforward. Otherwise, more work is needed at runtime. In the former case, each task is created, ordering relationships are determined based upon OpenMP semantics, and the data read and written by each task is computed. Task sequences and its read/write accesses to regular sections of arrays are also obtained. For example, we present the algorithm for translating the DO construct in Figure 1.
To illustrate the above algorithm, we use the code in Figure 2 and assume that NT is 2. Tasks will be created by subdividing the work in the outer loop according to the default schedule. If eight tasks have already been created when this loop is encountered, then it will generate two new tasks, t9 and t10, to perform the work in the loop. The first of these tasks will perform the first 50 iterations and the second will perform the remaining 50. Accordingly, task t9 has bounds I=1,50 and J=1,100 and will write the array section A[1:100, 1:50]. Task t10 has bounds I=51,100 and J=1,100 and writes to A[1:100, 51:100]. These tasks must both wait until all tasks involved in the previous construct have been completed, unless the NOWAIT clause was specified there.
Each of the SINGLE, MASTER and SECTION constructs corresponds to one task. So for the code in Figure 2, task t11 will be created to execute the code within the SINGLE construct that has write access to array section A[1:1,1:1]. Both t9 and t10 will be predecessors of t11. By applying
Figure 1 The generation of tasks from an OpenMP parallel loop
Let NT be the number of threads and tk be the last task before this loop DO I = 1, NT k = k + 1
Compute bounds of tk according to loop schedule Compute array accesses by tasks tk Add tk to N ENDDO
Figure 2 Example of DO
!$OMP PARALLEL … !$OMP DO DO I= 1,100 DO J=100 A(J,I) = ENDDO ENDO !$OMP SINGLE A(1,1) = !$OMP END SINGLE !$OMP END PARALLEL
For each e i,j in E do If (ti.RWaccess ∩ tj.RWaccess)=φ then Remove ei,j from E For each tk ∈ pr(ti) do If (tk.RWaccess ∩ tj.RWaccess)≠φ then Add ek,j to E Compute ek,j .reuseLevel Endif Endfor Else Compute ei,j.reuseLevel Endif Endfor
Figure 3 Algorithm for eliminating edges from task graph
the algorithm of Figure 3, we can remove edge e10 ,11. Let there be an edge from ti to tj in the task graph and let Ri be the data read by task ti and Wi
be the data it writes. As already described, we must compare the data written in each with the data read and written in order to determine whether we can remove the edge. If we can do so, we must compare the data accessed by task tj with that accessed by all predecessors of task ti in the graph. For the purpose of scheduling, we are also interested in situations where two tasks read the same data. This information will also be recorded in the form of a special edge.
If there is an edge between node ti and tj in the modified task graph, the compiler must further determine the level of data reuse between the two tasks; the result will be appended to the edge and denoted by ei,,j.reuseLevel. The reuse level is determined by the amount of data accessed by both source and sink tasks, and the type of dependence between them. For example, if both tasks write the data, the reuse level will be higher than if both read the same data. The execution time of each task ti in the task graph, w(ti) is statically estimated by the number and type of operation and operand, available information on loop bounds, etc. An alternative would be to provide a profile option. We may also be able to provide better estimates at run time, when symbolic bounds etc. can be evaluated.
Note that tasks may be subdivided to improve their cache behavior as well as to isolate iterations that give rise to synchronization, and that reordering of computation within a task may support the former but not the latter. Hence, techniques to determine appropriate granularity and to move constructs causing synchronization are required. This topic is not discussed in this paper.
3.2 Mapping algorithm for data locality reuse
We use a heuristic algorithm to map tasks in the task graph to processors (Figure 4). It is based upon a list scheduling approach [33]. The goal of the algorithm is to map tasks that use the same data to the same processor. However, at each step of the way, it must decide which one is the next task to map: it maps “more important” tasks first when possible, where this is determined by the weight of a node.
The algorithm consists of an initialization and an iterative step. In the initialization step, the algorithm first computes the level of each task ti by determining the sum of the weights on paths from ti to the terminal node, taking both edge and task weights into account. It sets level(tj) to the greatest weight on such a path. It also computes #ipr(tj), the number of immediate predecessors of tj. The function Procs(ti) = pk records mapping decisions, i.e., task ti is mapped to processor pk; initially, it is set to NULL.
At any stage during the process, the ready queue will hold those tasks that are ready for mapping but not yet assigned to processors. The algorithm initially puts all tasks without any predecessors in the task graph (“initial tasks”) into the ready queue (RQ) and orders them based upon their level; then it selects the highest priority (highest level) task tcurrent from RQ as the next ready task, assigns it to processor 1, and removes tcurrent from RQ.
The iterative step successively selects tasks and allocates them. In step 2.1, the algorithm will select the most important of the immediate successors of tcurrent by examining those for which data reuse between it and tcurrent is highest. If this task is subsequently chosen for mapping, it will be allocated to the same processor as tcurrent. It uses the two functions list_max_isucs() and max_ipred() to do so. The function list_max_isuccs() will return a list of those immediate successors of ttemp for which the highest reuse level is to be found on the edge from tcurrent. In other words, suppose ts is one of tcurrent’s immediate successors with two incoming edges from tcurrent and some other task tk. If ek,current.reuseLevel > es,current.reuseLevel, then ts will not be included in the list returned by this function. The function Max_ipred() selects the task from those returned by list_max_isuccs() that corresponds to the edge from tcurrent with the highest reuse level. This edge,
ttemp, will be temporarily mapped to the processor that executes tcurrent, and its incoming edge from tcurrent’s is marked as visited. The visited edges only affects list_max_isucs() and max_ipred().
We next compare the priority (weight) of the successor task chosen with the highest priority task in the ready queue. Whichever one has highest priority will be mapped next; we call it tnext. Before continuing, we now add any tasks to RQ if all their successors have already either been assigned to processors or placed in RQ. In step 4, we remove the next task from RQ, if necessary, and fix its mapping. If it is an immediate successor of tcurrent, then it will be mapped to the same processor. Otherwise, we call function Locate_Processor to check if it has been temporarily mapped to a processor; if it has, this function will map it to that processor. Otherwise, the function will find a processor that is idle for mapping. Finally, we set tnext to be the new tcurrent before beginning the next iteration.
Input: Task Graph G(N,E) and m processors of distributed shared memory system. Output: procs(ti) = pk for each task ti, the assignment of tasks to processors Begin
1. Initialization step: 1.1 For each task ti in G do
1.1.1 Compute level(ti) and #ipr(ti) 1.1.2 Initialize Procs(ti) to NULL
1.2 All initial tasks are added to RQ (Ready Queue) ordered by level number 1.3 Get highest priority task in RQ and set it to tcurrent. Remove it from RQ.
Procs(tcurrent) = 1 Repeat 2. Select next task for mapping:
2.1 Select best immediate successor of tcurrent: If tcurrent has immediate successor then ttemp = max_ipred(list_max_isuccs(tcurrent)) temporary map ttemp to procs(tcurrent) mark_visited(inedge(tcurrent)) else ttemp = NULL Endif
2.2 Compare priority with tasks in RQ and select tnext: tnext = maxweight(ttemp, RQ) /* task with max priority of ttemp and RQ */
3. Process successors of tcurrent: 3.1 For each ti ∈ isuccs(tcurrent) Do
#ipr(ti) = #ipr(ti) – 1 If (#ipr(ti)==0) then
Add ti to RQ in priority order Endif Endfor
4. Map next task: 4.1 If tnext is in RQ then Remove tnext from RQ
If tnext is immediate successor of tcurrent then Procs(tnext) = Procs(tcurrent)
else Locate_processor(tnext, PL); Procs(tnext)=Procs(PL) Endif If not visited(inedge(tnext)) then Mark_visited(inedge(tnext)) Endif
5. Update current task: tcurrent = tnext
Until all tasks in G are assigned End
Figure 4 Data reuse mapping algorithm
3.3 Code Generation
3.3.1 Static case
In the static case, the compiler collects as much information as possible to avoid runtime overheads and to generate code that passes complete information to the runtime system. Such information includes task identification, possibly loop iterations, task mapping information, read/write access information (pass task graph information if other runtime system is used). This is necessary so that the task can be executed on the processors for data reuse and can be out-of-order execution for those independent tasks.
After compile-time task mapping has been performed, we can generate source code such as that in Figure 5 as an example from Figure 2. The main program is first executed by the master thread, which creates additional threads that each bounds to processor and pass information stated above. In Figure 6, task_loop0 and task_single are routines for tasks correspond to DO construct and SINGLE construct of Figure 2, respectively.
3.3.2 Dynamic case
Most application programs have a significant number of input-data dependent unknowns, and the number of processors may also be unknown at compile time. Compile-time analyses can
call SM_concurrency(n) /*to create n-1 slave threads to execute tasks*/ call SM_define_array(info) /* SMARTS Initialization */ schedtype = hardaffinity call SM_startgen() do i = 1, iter k = k + 1 call SM_def_readwrite(task_loop0, k) call SM_getinfo(task_loop0, affin, datinfo) call SM_handoff(task_loop0, schedtype, procs(task_loop0,k), datinfo) end do k = k + 1 call SM_def_readwrite(task_single,k) call SM_handoff(task_single, schedtype, procs(task_single,k), datinfo) call SM_run() call SM_end()
Figure 5 Main program using SMARTS
subroutine task_loop0(datinfo) use shared_DATA call SM_compute_bound(datinfo,lb1,ub1,lb2,ub2) do j = 1b1, ub1 do i = lb2, ub2 A(i,j) = enddo enddo end subroutine task_loop0 subroutine task_single(datinfo) use shared_DATA A(1,1) = end subroutine task_single
Figure 6 Parallel loops using SMARTS
consequently produce symbolic expressions, despite the application of interprocedural constant propagation and other techniques for maximizing available information about variables. Therefore, symbolic analysis is required to support compile time analyses as well as to reduce the cost of runtime analysis. Several techniques have been proposed for evaluating symbolic expressions and for expressing information about program variables at arbitrary program points, as well as for aggressive simplification of a system of constraints of symbolic expressions, determining the relationships between symbolic expressions, and computing their lower and upper bounds.
Blume and Eigenmann [27] use abstract interpretation to extract information about variable ranges. Haghighat [25] presents a variety of symbolic analysis techniques based on abstract interpretation. Fahringer [26] presents a unified symbolic evaluation framework that statically determines the values of variables and symbolic expressions. He uses the Omega library[18] for simplifying first order logic expressions and constraints.
We are working on the generation of a routine that computes the array sections of those symbolic bounds and symbolic array sections that have not been resolved at compile time. (At compile time, the symbolic analysis consists of an aggressive simplification of a system of constraints of symbolic expression, determining the relationship between symbolic expressions, computing lower and upper bound of symbolic expression.) In this case we may need to (partially) construct the task graph and perform the task mapping algorithm at run time.
4 EVALUATION AND EXPERIMENTAL RESULTS
To illustrate how the mapping algorithm of Figure 4 works, we present a simple example task graph in Figure 7. After step 1.1, the levels of the nodes for tasks t1, t2, t3, t4, and t5 are 81, 101, 1, 1, and 1, respectively, and the numbers of their immediate predecessors (#ipr) are 0, 0, 1, 2, and 2, respectively. In step 1.2, two independent tasks: t1 and t2 are put into the RQ ordered by the node level or priority. In step 1.3, task t2 is selected and removed from RQ as ready task tcurrent, and t2 is assigned to processor p1.
In step 2.1, t2 as tcurrent is passed as input parameter to list_max_isuccs(), in which t2 has two immediate successors: t4 and t5. Task t4 has two incoming edges: e1,4 and e2,4 with weights of 80 and 100, respectively. Since e2,4.reuseLevel is the greatest, it is in the list. Similarly, t5 has two incoming edges: e1,5 and e2,5 with their weights of the same value, so e2,5 is also in the list; list_max_isuccs(ta) returns (e2,4),(e2,5) as input to Max_ipred(), and since e2,4.reuseLevel has the maximum value, it returns t4 to ttemp and is temporarily mapped to p1. It then marks e2,4 and e1,4 as visited, indicated by the dotted line in Figure 8, needed for the internal workings of the function Max_ipred(). In step 2.2, t4 is compared with task t1 that has highest priority in RQ, t1 is assigned to tnext. In step 3, #ipred(t3) and #ipred(t4) are decremented to 1. In step 4, t1 is removed from RQ. Task t1 is not the immediate successor of t2, therefore Locate_processor() return p2, since t2 has not been temporarily mapped to any processor. Next, t1 become tcurrent.
Figure 9 Mapping of Fig. 7
T4
T2
T3 T5
T1
P1 P2 P3
Figure 7 Example task graph
1
1
70 20
81 101 1
1
1
100 20 80
1 1 1
t1 t2
t3 t4 t5
Figure 8 After Mark visited
1
1
70 20
81 101 1
1
1
100 20 80
1 1 1
t1 t2
t3 t4 t5
In the next iteration of step 2.1, the best successor of t1 is t3, so ttemp is t3, which is temporarily mapped to p2. It marks e1,3 visited. In step 2.2, t3 is compared with the maximum priority task in RQ; since RQ is empty, t3 is assigned to tnext. Now in step 3, #ipr(t3),#ipr( t4), and #ipr(t5 ) are decremented to 0, they are put into RQ. In step 4, since t3 is in RQ, it is removed. Because t3 is immediate successor of t1, t3 is mapped to thread that executes t1, which is p2. t3 becomes tcurrent.
In next iteration of step 2.1, t3 does not have a successor, so ttemp is NULL. In step 2.2, tnext is t4. In step 4, t4 is in RQ, and is thus removed from RQ; since t4 has been temporarily mapped, Locate-processor()map it to p1. Then, t4 becomes tcurrent. Finally, in the last iteration of step 2.1, t4 does not have a successor, so tnext is t5. In step 4.1, t5 is removed from RQ and mapped to p3 by locate_processor() since it has not been temporarily mapped. The result is shown in Fig. 9.
In the following, we discuss two small codes, one an LU decomposition and the other an ADI iteration, that have been translated using our approach.
4.1 LU example code to demonstrate the data reuse
We present the OpenMP LU example in Figure 10 to illustrate the generation of tasks, task graph, and the data locality reuse scheduling algorithm. Task tk is generated for k from 1 to 5 respectively, corresponds to the SINGLE construct. When k is 1, the memory access pattern of task t1 involves write access to section [2:5,1:1] of array A and read reference to regular section [1:5,1:1] of array A is obtained from [2:5,1:1] ∪ [1:1, 1:1]. Similarly, task t2 involves write access to array A[3:5,1:1] and read access to [2:5,1:1]. All array sections in the example can be exactly represented when we further decompose a block of data into several columns. This requires delayed union operation.
A collection of tasks tkc where k =1,5 and c is the chunk number which is obtained from the OpenMP DO construct, is generated from the code of Figure 10. It depends on the loop size, the number of threads and the scheduling chosen; in the present example, it is static scheduling, the chunk size is equal to one, and the number of threads is 4. The reference access to array A of t11 is [2:5,2:2] ∪ [2:5,1:1] ∪ [1:1,2:2]. Our delayed union algorithm does not summarize [2:5,1:2] ∪
Figure 10 Simple OpenMP LU code example
Program LU integer n parameter (n=5) double precision a(n,n) !$OMP PARALLEL do k=1,n !$OMP SINGLE do m=k+1, n a(m,k) = a(m,k) / a(k,k) enddo !$OMP END SINGLE !$OMP DO PRIVATE(i,j) do j=k+1,n do i=k+1,n a(i,j)=a(i,j)- a(i,k)*a(k,j) enddo enddo !$OMP ENDDO !$OMP END PARALLEL
TK
TKC
T1
T41
T14 T13 T12 T11
T32 T31
T23 T22 T21
T2
T3
T4
Figure 11 Task graph and memory access pattern
[1:1,2:2] into one representation. Similarly, access to array A by t12 is [2:5,3:3]∪ [2:5,1:1]∪ [1:1,3:3] which can be precisely summarized only to [1:5,3:3]∪ [2:5,1:1]. If t12 is summarized directly to [1:5,1:3], then there will be dependences between t12 and t2, which is a false dependence. Hence, the delayed union operation provides superior accuracy in this case. Figure 11 shows the task graph of the corresponding program example in Figure 10. It also shows the access pattern of each task. The dark shade represents read and write accesses, while the lighter shade represents read only accesses to array A.
In the OpenMP static scheduling shown in Figure 12, we assume that the number of threads is four, and the work is executed by the first touch policy. It is clear that the data accessed by a processor will not be the same as that accessed in the next iteration of the k loop. Consider the scheduling of OpenMP in Figure 12; after t1 and t11 have been scheduled to processor p1, both tasks were the first that touch columns 1 and 2 respectively. Task t12 in processor p2 first touches column 3 of array A, p3 first touches column 4, and p4 first touches column 5. There is no problem with t2, since it reuse column 2. But t22, t23 absolutely cannot reuse the data that has been brought into the local memory, since in p2, t22 reads column 2, and reads and updates column 4, where its previous t12 first touched the column 3 on the same processor. Hence there is no reuse and it needs to access data from the remote memory.
In addition, there are synchronization overheads in OpenMP, for instance, t2 needs to wait for all t11, t12, t13, and t14 to finish; therefore, t2 can only be executed after t14. We remove synchronization overhead by translating the OpenMP code to task graph; then t2 only needs to wait for t11, it does not need to wait for t12, t13, and t14 to finish. Our mapping is scheduled at runtime by SMARTS runtime system and the result is shown in Figure 13. It can be easily observed that our results will schedule for data reuse, giving better locality reuse.
4.2 ADI example exploiting macro-pipelined computation for data reuse
Pipelining computation and communication have been exploited to improve the performance
of many scientific applications [21][23]. Brandes and Desprez [21] integrated pipelined computation successfully within an HPF using compiler called ADAPTOR and gave impressive experimental results. Pipelining has been used by Chamberlain at al. [29] to exploit parallelism in wavefront computations. They introduce extensions to array languages to support pipelining
t11
t1
t12 t13 t14
t21
t2
t22 t23
t3
t31
t4
t32
t41
Figure 13 Our scheduling based on data locality reuse on four processors
Figure 12 Static scheduling on OpenMP on four processors
t11
t1
t12 t13 t14
t21
t2
t22 t23
t3
t31
t3
t31
t41
wavefront computation. Gonzalez at al. [23] propose a set of extensions to OpenMP to allow complex pipelined computations. This is done by defining directives in terms of thread groups and precedence relations among tasks originated from work-sharing constructs, as ours are. It assumes that heuristic scheduling is available to achieve this. But they do not propose any extensions explicitly for macro-loop pipelined computation.
Figure 14 presents a Fortran ADI kernel written in OpenMP. The outer loop of the first parallel loop is parallelized so that each thread accesses a contiguous set of columns of the array A. The outer loop of the second loop is also parallelizable, but as a result each thread accesses a contiguous set of rows of A. The static OpenMP standard execution model is shown in Figure 15, and it can be easily seen that it has poor data locality. Even though we can interchange the second loop so that it also accesses A in columns, parallelism is limited as a result of the overheads of the DO loop and the lack of data locality; and the effects of this can be easily observed. Our strategy for handling this situation is to create a version in which threads access the same data and the second loop is executed in a macro-pipelined manner. The task graph that was created for this transformed code is shown in Figure 16. When we apply our mapping algorithm to this task graph, it realizes a macro-loop pipeline wavefront computation that enables data locality reuse.
!$OMP PARALLEL !$OMP DO do j= 1, N do i= 2, N A(i,j)=A(i,j)-B(i,j)*A(i-1,j) end do end do !$OMP END DO !$OMP DO do i= 1, N do j= 2, N A(i,j)=A(i,j)-B(i,j) * A(i,j-1) end do end do !$OMP END DO !$OMP END PARALLEL
Figure 14 ADI Kernel code using OpenMP
T1 T2 T3 T4
T6 T7
T7 T8 T9
T5
Figure 15 OpenMP standard execution of Figure 14
t1 t2 t3
t4 t5
t6 t7
t8
t9 t10
t11
t12
Figure 16 Macro-pipelined computations of Figure 14
4.3 Experimental result
We compare OpenMP versions of the LU kernel and the ADI kernel with the code translated according to our scheme and using SMARTS as a runtime for code. Our target platform was an Origin 2000 at the National Center for Supercomputing Applications (NCSA). All codes were compiled with SGI’s MIPSpro Fortran 90 compiler under the options –64 –Ofast –IPA and run on MIPS R10000 processor at 195 MHz with 32 Kbytes of split L1 cache, 4 Mbytes of unified L2 cache per processors and 4 Gbytes of DRAM memory. We made use of the first touch allocation strategy provided by SGI, where a datum is stored on the node where it is first accessed. We also set _DSM_MIGRATION (page migration) OFF.
LU Kernel Code
0
10
20
30
40
50
60
1 2 4 8 16 32
Number of processors
Exe
cutio
n ti
me
in s
ecs
OMP MIPSpro
OMP Reuse
SMARTS
We show results for the LU computation using a 1024 by 1024 double precision matrix in Figure 17. We compile the original OpenMP LU code compiled with MIPSpro is labeled MIPSpro. The special hand written OpenMP LU code for scheduling reuse compiled with MIPSpro compilers. Here we see that the SMARTS runtime system introduces non-trivial overheads. The master/slave scheme controls the distribution of work to slave threads. The coordination incurs the overhead for the runtime system. In addition, SMARTS does not accept task graph information in the form of precedence constraints between tasks; instead, it requires information on the read/write access tasks makes to data objects, using these to (re)compute the data dependences at runtime. Therefore, it incurs significant overheads while recomputing information that is already available. We initially chose SMARTS in order to experiment with various strategies for runtime scheduling. Unfortunately, the high overheads in this case make it hard to show the benefits of our approach. In future, we plan to experiment with other runtime systems that suit this need. At the same time, we also want to investigate and develop a method to reduce these overheads possibly by minimizing the number of data objects passing to the runtime system, while avoiding the false dependences caused by decomposition of array into appropriate size.
In this experiment, the only array A in the program it is declared as shared. In addition, we avoid false dependences by breaking the work into smaller tasks. In SMARTS terminology, each array region accessed by a task is a data object. Since the SMARTS runtime system needs to handle many small data objects as a result of our work decomposition, the initialization overheads caused by passing information to the runtime system leads to a slightly longer execution time for the translated code on small numbers of processors. Even though the overhead incurred by the
Figure 17 LU kernel Code
runtime system is high in this case, it is amortized by the good data reuse as the problem size and the number of iterations increases.
ADI Kernel Code
0
1
2
3
4
5
6
7
1 2 4 8 16 32
Number of processors
Exe
cuti
on t
ime
in s
ecs. OMP MIPSpro V1
OMP MIPSpro V2
SMARTS
Figure 18 illustrates our results on ADI kernel code. There are two version of OpenMP code.
Version one parallelizes the outer loop along i loop of Figure 14. Where version two interchanges the loop and parallelizes the inner loop along i loop. For our measurement, we executed the code with an array of size 1024 by 1024. Our translated code to dataflow execution model enables macro-pipelined computation labeled as SMARTS. In the translated code, the number of data objects and read/write requests depend on the decomposition of array A and B. Since it does not need to avoid false dependence by the decomposition of arrays into smaller block size for this code, there is no substantial overhead, for even with a few processors. With an increasing number of processors, translated code also gave a better performance and scale well up to 32 processors.
5 RELATED WORK
The method provided by OpenMP for enforcing locality is to declare variables to be private to threads. One of the best ways to get high performance is to write code in the SPMD (Single Program Multiple Data) style which involves a systematic privatization of program data [19][20]. However, this may require significant reprogramming, which undermines one of the major benefits of the OpenMP API. The use of OpenMP and MPI [5] together also works, but at the price of considerable additional program complexity. In addition, the program will be more difficult to maintain. Several approaches have been proposed to help the user obtain good performance with modest programming effort, including extending OpenMP by data distribution directives and directives to link parallel loop iterations with the node on which specified data is stored [2][3], first touch memory allocation, dynamic page migration by the operating system [4], user-level page migration [4], and compiler optimization [6].
Dynamic page migration has been used in ccNUMA machines to improve data locality. In this scheme, the operating system keeps track of memory page counters to check whether to move a page to a remote node that frequently accesses it. However, this information cannot be associated with the semantics of the program, and if compile-time information is not passed to the operating
Figure 18 ADI kernel code
system, the latter may move a page when it is not helpful to do so, or it may not move a page that should be migrated. User-level page migration is then required to improve this approach. The compiler or performance tools are needed to help identify the regions of the program that are the most computation-intensive so that calls to page migration routines may be inserted accordingly. Although this is likely to lead to an improvement, it still cannot precisely represent the semantics of the program without compile-time analysis of data access patterns. Programming using data distribution directives has the potential to help the compiler ensure data locality; however, it means that programmers should be aware of the pattern in which threads will access data across large program regions. Hence the programming task remains more complex than with plain OpenMP. Moreover, there is no agreed standard for specifying data distributions in OpenMP; existing ccNUMA compilers have implemented their own data distribution directives, sacrificing portability. The most successful existing strategy is also the simplest: first-touch data allocation places data so that it is local to the thread that first allocates it. Our algorithm also uses this technique, provided by SGI, HP and Sun among others, to ensure appropriate initial storage for data. Thus our work can be viewed as attempting to maintain this locality.
We translate OpenMP to tasks that may share data and map these tasks to a cc-NUMA platform. Most researchers who focus on task mapping problems have performed work to handle the mapping of processes on distributed systems [32][30]. Our tasks may be finer grained than these. Moreover, most distributed memory tasks already have a clear communication structure, which helps in the evaluation of alternatives and generally forms the basis for the solution proposed. Very little work has considered the problem of mapping shared memory tasks on distributed shared memory systems, such as cc-NUMA, where there is a lower penalty for making non-local accesses and the code does not make non-local data references explicit. Liang et al. [31] proposed a static task mapping method for handling shared memory programs on a DSM system based on a two-dimensional Hopfield neural network to map a group of related tasks to a group of workstations that provide DSM. However, the neural network method is too costly to apply to dynamic cases. They also have difficulty estimating the cost that is incurred by remote memory accesses and cache misses, since the actual location of shared data depends on the operating system. We use the first touch approach to minimize the last of these problems.
6 CONCLUSION AND FUTURE WORKS
Our goal is to search for compiler techniques that improve the performance of OpenMP codes without the necessity of manual program modification, especially on large ccNUMA systems. Difficulties include the cost of non-local data accesses, which remain significantly higher than local references, the high amount of global synchronization in most OpenMP codes, and false sharing of data in cache. To achieve our aims, we translate OpenMP codes to a form that may be executed in a dataflow fashion that enables large-scale out-of-order execution, and explore the ability of the compiler to reduce synchronization overheads, as well as to improve data reuse in the resulting ordered collection of tasks [16]. In this paper, we present a practical algorithm for the generation and improvement of a task graph that represents the necessary synchronization in an OpenMP code. We propose an algorithm for mapping the resulting tasks to processors for data reuse. From this, we can generate code scheduling tasks for reuse at runtime. We also demonstrate that this scheme can enable macro-wavefront pipelined computation. Our approach relies on compiler technology and a suitable runtime system; thus there is no need for user intervention.
Despite using a relatively inefficient runtime system in our experiments, the results with LU and ADI kernels show that our strategy has improved performance. However, there is more to be learned. Our mapping algorithm is sensitive to the method for weighting of nodes and edges, which must represent data transfer and cache effects; we intend to continue to experiment with these parameters. Our current effort does not take the problem of false sharing of cache data into account and this needs to be considered. Our future work also includes exploiting symbolic analysis to
improve our ability to handle more difficult cases and to lower the overheads of dynamic task scheduling.
Acknowledgement
This work has been supported by the Los Alamos Computer Science Institute under grant number LANL 03891-99-23 (DOE W-7405-ENG-36) and by the NSF via the computing resources provided at NCSA. We thank Suvas Vajracharya for introducing us to SMARTS, discussing its functionality with us and encouraging us to use the library. REFERENCES [1] OpenMP Architecture Review Board, Fortran 2.0 and C/C++ 1.0 Specifications. At www.openmp.org. [2] B.Chapman, P.Mehrotra, and H.Zima, Enhancing OpenMP with Features for Locality Control. Proc. ECMWF
Workshop “Towards Teracomputing - The Use of Parallel Processors in Meteorology”. Reading, England, November 1998.
[3] J. Bircsak, P. Craig, R. Crowell, J. Harris, C. A. Nelson, C. D. Offner, Extending OpenMP For NUMA Machines: The Language, WOMPAT 2000, Workshop on OpenMP Applications and Tools, San Diago, July 2000.
[4] D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta and E. Ayguade, Is Data Distribution Necessary in OpenMP? Proc. of the IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conference, Dallas, Texas, November 2000.
[5] L. A. Smith and P. Kent, Development and Performance of a Mixed OpenMP/MPI Quantum Monte Carlo Code,Pro- ceedings of the First European Workshop on OpenMP, Lund, Sweden, Sept. 1999, pages. 6-9.
[6] S. Satoh, K. Kusano, and M. Sato, Compiler Optimization Techniques for OpenMP Programs, Second European Workshop on OpenMP (EWOMP 2000), Edinburgh, September 14-15, 2000.
[7] S. Vajracharya, S. Karmesin, P. Beckman, J. Crotinger, A. Malony, S. Shende, R. Oldehoeft, and Stephen Smith, Exploting Temporal Locatlity and parallelism through Vertical Execution. ICS ’99.
[8] S. Vajracharya, P. Beckman, S. Karmesin, K. Keahey, R. Oldehoeft, and C. Rasmussen, Programming Model for Cluster of SMP. PDPTA ’99.
[9] P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3), pages 350-360, July 1991.
[10] V. Balasundaram, K. Kennedy, A Techniques for Summarizing Data Access and its Use in Parallelism Enhancing Transformations, Proceedings of ACM SIGPLAN'89 Conference on Programming Language Design and Implementation, Jun. 1989.
[11] R. Triolet, F. Irigoin, P. Feautrier, Direct Parallelization of CALL Statements, Proceedings of ACM SIGPLAN '86 Symposium on Compiler Construction, Palo Alto, CA, July 1986, pages 176-185.
[12] B'eatrice Creusillet. IN and OUT array region analyses. In Workshop on Compilers for Parallel Computers, June 1995.
[13] B. Creusillet and F. Irigoin. Interprocedural array region analyses. In Lecture Notes in Computer Science - Languages and Compilers for Parallel Computing, pages 46-60. Springer-Verlag, August 1995.
[14] Michael Hind, Michael Burke, Paul Carini, and Sam Midkiff. An emperical study of precise interprocedural array analysis. Scientific Programming, 3(3), pages 255-271, 1994.
[15] Zhiyuan Li and Junjie Gu and Gyungho Lee. Symbolic array dataflow analysis for array privatization and program parallelization. In Supercomputing 95, 1995.
[16] Tien-Hsiung Weng, Barbara M. Chapman. Implementing OpenMP using Dataflow execution Model for Data Locality and Efficient Parallel Execution. In Proceedings of the 7th workshop on High-Level Parallel Programming Models and Supportive Environments, (HIPS-7), IPDPS, April 2002, Ft. Lauderdale.
[17] P. Tang. Exact Side Effects for Interprocedural Dependence Analysis. Communications of the ACM, 35(8), pages102- 114, August 1992.
[18] William Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis; Communications of the ACM, 35, pages 102-114, August 1992.
[19] Zhenying Liu, Barbara Chapman, Tien-hsiung Weng, Oscar Hernandez. Improving the Performance of OpenMP by Array Privatization. WOMPAT 2002, LNCS 2716, pages 244-256.
[20] Zhenying Liu, Barbara Chapman, Yi Wen, Lei Huang, Tien-hsiung Weng, Oscar Hernandez. Analyses for the Translation of OpenMP Codes into SPMD Style with Array Privatization. WOMPAT 2003, LNCS 2716, pages 26-41.
[21] T. Brandes and F. Desprez. Implementing pipelined computation and communication in an HPF compiler. In L. Boug'e, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Euro-Par'96 Parallel Processing, number 1123 in Lecture Notes In Computer Science, pages 459-462, Lyon, France, August 1996. Springer.
[22] F. Desprez. A Library for Coarse Grain Macro-Pipelining in Distributed Memory Architectures. In Proceedings of the IFIP 10.3 Conference on Programing Environments for Massively Parallel Distributed Systems, pages 365 371,
Monte Verita, Ascona, CH, April 1994. [23] S. Hiranandani, K. Kennedy, and C-W. Tseng. Evaluating compiler optimizations for Fortran D. Journal Of Parallel
and Distributed Computing, 21, pages 27 45, 1994. [24] Gonzalez, M., Ayguade, E., Martorell, X. And Labarta, J.: Complex Pipelined Executions in OpenMP Parallel
Appliations. International Conferences on Parallel Processing(ICPP 2001), September (2001) [25] Mohammad R. Haghighat and Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers.
ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 18, number 4, pages 477-518, 1996. [26] Thomas Fahringer and Bernhard Scholz. Symbolic evaluation for parallelizing compilers. In Proceedings of the
11th international conference on Supercomputing, pages 261-268, Vienna, Austria,1997. [27] W. Blume and R. Eigenmann. The range test: A dependence test for symbolic, non-linear expressions. In
Proceedings of Supercomputing '94. IEEE Press, November 1994. [28] D. S. Nikolopoulos and E. Artiaga and E. Ayguadé and J. Labarta. Exploiting memory affinity in OpenMP through
schedule reuse. ACM SIGARCH Computer Architecture News, vol.29 (5), pages 49-55, 2001. [29] B. Chamberlain, C. Lewis and L. Snyder. Array Language Support for Wavefront and Pipelined Computations In
Workshop on Languages and Compilers for Parallel Computing, August 1999. [30] T. Yang, A. Gcrasoulis, PYRROS: Static task scheduling and code generation for message passing multiprocessors,
in: International Conference on Supercomputing, 1992. [31] Liang, T. Y., Shieh, C. K., and Zhu, W. P., Task Mapping on Distributed Shared Memory 25 Systems Using
Hopfield Neural Network, In Communication Networks and Distributed Systems Modeling and Simulation Conference, Western Multi-Conference '97 (January 1997), pages 37-43.
[32] Marco Vanneschi. Hetergeneous HPC Environments. Euro-Par’98, pages 21-34. [33] Yang, T., Gerasoulis, A. List Scheduling with and without Communication. Parallel Computing Journal (19), pages
1321-1344, 1993. [34] Weng, T.-H., Chapman, B., Wen, Y.: Practical Call Graph and Side Effect Analysis in One Pass. Technical Report,
University of Houston, Under preparation. [35] The Open64 compiler. http://open64.sourceforge.net/ [36] Barbara Chapman, Oscar Hernandez, Lei Huang, Tien-hsiung Weng, Zhenying Liu, Laksono Adhianto, Yi Wen.
Dragon: An Open64-Based Interactive Program Analysis Tool for Large Applications. To appear in the Proceedings of the 4th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT ’03).
Hierarchical Interconnection Networks Based on (3, 3)-Graphsfor Massively Parallel Processors
Gene Eu Jan Yuan-Shin Hwang Chung-Chih Luo Deron LiangDepartment of Computer ScienceNational Taiwan Ocean University
Keelung 20224Taiwan
Abstract
This paper proposes several novel hierarchical interconnection networks based on the (3, 3)-graphs, namely folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks, and flooded (3, 3)-networks. Just as the hypercubes, CCC, Peterson-networks, andHeawood networks, these hierarchical networks have the following nice properties: regulartopology, high scalability, and small diameters. Due to these important properties, these hi-erarchical networks seem to have the potential as alternatives for the future interconnectionstructures of multicomputer systems, especially massively parallel processors (MPPs). Fur-thermore, this paper will present the routing and broadcasting algorithms for these proposednetworks to demonstrate that these algorithms are as elegant as the algorithms for hypercubes,CCC, and Petersen- or Heawood-based networks.
Index Terms: Broadcasting Algorithm, Interconnection Networks, Routing Algorithm, (3, 3)-Graph, Folded (3, 3)-Networks, Root-Folded (3, 3)-Networks, Recursively Expanded (3, 3)-Networks, Flooded (3, 3)-Networks
Corresponding Author: Gene Eu Jan ([email protected])
1 Introduction
High-performance computing is centered around the concept of parallel processing. One majorway to achieve this goal is by integrating multiple processors through an interconnection network,especially massively parallel processors (MPPs) which can host hundreds or even thousands ofprocessors. The performance of the entire system is then determined not only by the processorsbut also by the underlying interconnection network. Hence, it is essential to have high-performanceinterconnection networks for MPPs to achieve high performance.
Various high-performance interconnection networks have been extensively studied in the litera-ture, such as meshes, hypercubes, twisted hypercubes [1, 6, 7], cube-connected cycles (CCC) [16],recursive networks [3, 4], and pyramids [18], etc. Among these networks, the hypercube familyhas been popular due to that it has several elegant properties: symmetry, regularity, high fault-tolerance, logarithmetic degree and diameter, self-routing, and simple broadcasting schemes [8,10]. Nevertheless, new networks are being proposed and analyzed with regards to their applica-bility and enhanced topological or performance properties. The most popular ones are those thatbased on the Petersen or Heawood graphs and their derivatives [12]. These include the foldedPetersen cube networks [14], root-folded Petersen networks, recursively expanded Petersen net-works [11], hierarchical Heawood networks [9], and the hyper Petersen networks [5]. The majorfeatures of Petersen networks are: regular topology, high scalability, small diameter, and less net-work cost than hypercubes.
This paper presents several new hierarchical interconnection networks based on the (3, 3)-graphs [15]: folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks,and flooded (3, 3)-networks. They all have basically the same features as the Petersen or Heawoodnetworks, and hence are perfect candidates to serve as the interconnection structures of massivelyparallel processors.
In general, the following major criteria are commonly used to evaluate an interconnection net-work: diameters, degrees, connectivity, and cost [14]. The diameter of a network is the maximumdistance among all node-pairs. The degree of a node is the maximum number of links connected toit. The node connectivity (edge connectivity) of a graph is the minimum number of nodes (edges)whose removal results in a disconnected network. The product of degree and diameter is usuallycalled the cost of the network. Consequently, any underlying interconnection networks of MPPsmust have the properties of small diameters, reasonable degrees, and low cost. This paper willevaluate these hierarchical networks based on these properties.
The rest of the paper is organized as follows. Section 2 describes the (3, 3)-network and itsaddressing scheme. Section 3 extends the (3, 3)-networks to several -dimensional hierarchicalnetworks and presents the routing and broadcasting algorithms for these hierarchical networks.Section 4 evaluates these hierarchical networks by comparing their topological properties. Sec-tion 5 concludes this paper.
1
(1,1) (1,2)
(1,3)
(2,1)
(2,2)
(2,3)
(3,3)
(3,1)
(3,2)
(1,0)
(2,0)
(3,0)
(4,3)
(4,1)
(4,0)
(4,2)
(0,1)
(0,2)
(0,0) (0,3)
Figure 1: The (3, 3)-Graph on 20 Vertices
2 The (3, 3)-Graphs
This section recapitulates the definition and important properties of the (3, 3) graphs [15]. Sincethe interconnecting network of a parallel computer system based on the (3, 3)-graph is called a(3, 3)-network, it will share the same properties of the (3, 3)-graph. In addition, this section willpresents the routing and broadcasting algorithms for the (3, 3)-networks. Please note that the (3, 3)-networks defined in this section will be called as basic (3, 3)-networks in order to differentiate themfrom the hierarchical networks proposed in Section 3.
2.1 Definition and properties
A graph with maximum degree
and diameter is called a (
, D)-graph. The maximum numberof nodes on a (
, D)-graph is bounded by the Moore bound [2]
and the fact that a
-regular graph with
odd must have an even number of vertices. Therefore, a(3, 3)-graph can have at most 20 vertices (since the Moore bound is
! " ).
An example of a (3, 3)-graph on 20 vertices is depicted in Figure 1.The hierarchical networks proposed in this paper are based on this form of 20-node (3, 3)-graph.
The node addressing scheme the 20-node (3, 3)-graph is defined as follows.
Definition 1 (20-node (3, 3)-Graphs)A 20-node (3, 3)-graph is a maximum (3, 3)-graph with a degree of 3 and a diameter of 3. It has a
2
maximum number of vertices (i.e. 20) and can be divided into 5 subgraphs, each of which consistsof 4 vertices. Consequently, every vertice can be denoted as a tuple
, where
represents
the subgraph number and
indicates the node number within subgraph. Formally, a 20-node
(3, 3)-graph is a graph , where
is the set of
vertices and "! "!#$
is the set of edges.
Following are the properties of the 20-node (3, 3)-graphs:
1. Node 0 of every subgraph
connects to each node within. Specifically,
% " & '( ) +* ,
.
2. In addition to the link to node 0 the other two links of node 1 of every subgraph
connectsto node 2 of subgraphs
%.- 0/1324. Specifically,
)- 5/1324 " 6 7)* ,
.
3. Simlilarly, node 2 of every subgraph
has links to node 0 of
and node 1 of the neighboringsubgraphs of
. That is,
" 8- ,/91324 : '7+* (.
4. Node 3 of every subgraph
is connected to node 3 of subgraphs%"- "0/91294
, i.e. % %"-
8/9124 : '7+* .
2.2 Basic routing and broadcasting algorithms
Due to the symmetric topology of the (3, 3)-graphs, the routing and broadcasting algorithms forthe basic (3, 3)-network can be easily developed. In order to give a more concise presentation ofthe routing and broadcasting algorithms, the following neighboring function is defined:Algorithm Neighbors(
)
Find the neighbors of the node
within the graph begin
Find the local neighbors within the same subgraph
;)<>=@?AB< % CDE DF
%G H G 3 I J LK % M I JN
Find the neighbors outside the subgraph
;.O<>=@P%AB<
CDE DF %H- ,/13294 Q M I J N %H- H/13294 R I J SK
Return all the neighbors of node
%" Rreturn
;.O<>=@P%A< % LT ;)<>=@?AB< end
Algorithm Neighbors
3
Based on this function, the routing algorithm for the basic (3, 3)-network can be presented asfollows.Algorithm Basic-Routing( )
To route the message
from node to node begin
while doset
if #
Neighbors( ) then forward
to ; set else if ( Neighbors( )
TNeighbors( ))
thenforward
to ; set
else beginfor each element # Neighbors( )
if Neighbors( )
thenforward
to ; set
end forend if
end doend
Algorithm Basic-Routing
Since the longest distance between any two nodes on the basic (3, 3)-network is 3, the algorithmBasic-Routing takes at most 3 steps to route a message
from the source node to the destination
node .Similarly the broadcasting algorithm on the basic (3, 3)-network can be easily designed based
on the topological properties the (3, 3)-graph. To broadcast a message on a basic (3, 3)-network,it is basically equivalent to sending a message from the root node (i.e. the source node of themessage) of a minimum spanning tree of the basic (3, 3)-network to all other nodes. Following isthe broadcasting algorithm for the basic (3, 3)-network.
Algorithm Basic-Broadcasting( )To broadcast the message
from the source node
begin1: copy
to all of its neighboring nodes
; #Neighbors( )
2: All the nodes; #
Neighbors( ) copy
to the neighboring nodes #(Neighbors(
;)
Neighbors( ))3: All the nodes #
(Neighbors(;
)
Neighbors( )) copy
to the neighboring nodes inNeighbors( )
Neighbors(
;)
endAlgorithm Basic-Broadcasting
4
(0,0)
(0,1) (0,2) (0,3)
(1,2) (4,2) (1,1) (4,1) (2,3) (3,3)
(1,0) (2,1) (4,0) (3,1) (1,0) (2,2) (4,0) (3,2) (2,0) (4,3) (3,0) (1,3)
Figure 2: Broadcasting from Node (0, 0)
Figure 2 shows an example of the process of broadcasting a message from node (0, 0). It iseasy to show that both Basic-Routing and Basic-Broadcasting algorithms have a constant timecomplexity .
3 Hierarchical Networks Based on (3, 3)-Graphs
This section presents several hierarchical extensions of the basic (3, 3)-networks: folded (3, 3)-networks, root-folded (3, 3)-networks, recursively expanded (3, 3)-networks, and flooded (3, 3)-networks.
3.1 Folded (3, 3)-Networks
To generalize the (3, 3)-network into dimensions, several schemes are possible. Among these thefollowing one called the folded (3, 3)-network is the most pronounced due to that it has the node-symmetric and edge-symmetric properties. It is defined based on the same concept used in thedefinition of the folded Petersen network [13, 14]. This section defines the folded (3, 3)-networksand present the routing and broadcasting algorithms.
3.1.1 Definition and Properties
The formal definition of the folded (3, 3)-network is given as follows.
Definition 2 (Folded (3, 3)-Network)An -dimensional folded (3, 3)-graph is defined as ) R
, where M R #$ and M M@ R R @R # 0! "
5
Figure 3: An Example of a Two-Dimensional Folded (3, 3)-Network.
A two-dimensional folded (3, 3)-network ) is shown in Figure 3. There are 20 edges be-tween any neighboring subnetworks. However, the figure only shows the detailed connectionsbetween subnetworks )
and ) , where .
O < deontes the
th subnetwork of an
-dimensional folded (3, 3)-network . . The detailed connections between other subnetworksare left out for brevity.
3.1.2 Routing and Broadcasting Algorithms
The routing and broadcasting algorithms for the basic (3, 3)-networks shown in the previous sectioncan be extend to route and broadcast messages on the folded (3, 3)-networks.
Algorithm ) -Routing
The addresses of source and destination nodes and are represented as K
and K , respectively.
beginfor
to
step
do. -Basic-Routing( @ )
endfor
end
Algorithm ) -Routing
The . -Basic-Routing( @ ) algorithm is modified from the Basic-Routing( )
algorithm of the previous section and is shown as follows.
6
Algorithm ) -Basic-Routing( @ )begin
while doset
if #
Neighbors( ) then forward
to ; set else if ( Neighbors( ) T Neighbors( ))
thenforward
to ; set
else beginfor each element # Neighbors( )
if Neighbors( )
thenforward
to ; set
end forend if
end doend
Algorithm ) -Basic-Routing
It is easy to see that the above algorithm has time complexity of , where is the dimensionof the network.
Similarly, the broadcasting algorithm for the folded (3, 3)-networks can be extended from theBasic-Broadcasting algorithm for the (3, 3)- networks and is shown as follows.
Algorithm ) -Broadcasting( )begin
for
to
step
doBasic-Broadcasting( @ )
endfor
end
Algorithm ) -Broadcasting
Since the Basic-Broadcasting( ) algorithm requires constant time to execute, the time com-plexity of the above algorithm is again .
3.2 Root-Folded (3, 3)-Networks
Since every node, say . , of an -dimensional folded (3, 3)-network ) , is connected to
any of its neighboring nodes, say .
, with 20 edges, the cost of connections will be high as
grow larger. This section presents a new class of hierarchical (3, 3)-networks, called Root-Folded(3, 3)-Networks, as a remedy. The main difference is that any two neighboring nodes are nowconnected by a single edge.
7
Figure 4: A Two-Dimensional Root-Folded (3, 3)-Network
3.2.1 Definition and Properties
Definition 3 (Root-Folded (3, 3)-Networks)An -dimensional Root-Folded (3, 3)-network is defined as
. R where #$ and
@ @ M " S # ,0 &
3.2.2 Routing and Broadcasting Algorithms
The routing and broadcasting algorithms are listed as follows:Algorithm
) -Routing( )begin
while do"
if
then break and are the same node
else
beginfor
to
do Basic-Routing( @ )for
"to
do Basic-Routing( )
endelse
8
endwhile
end
Algorithm
) -Routing
where
is defined as
. Thetime complexity of the above routing algorithm for -dimentional root-folded (3, 3)-networks is .Algorithm
) -Broadcasting( )begin
for
to
do Basic-Routing( )for
to
do Basic-Broadcasting( )end
Algorithm
) -Broadcasting
Since the Basic-Broadcasting algorithm takes a constant time to execute, the time complexity ofthe above algorithm is again .
3.3 Recursively Expanded (3, 3)-Networks
The recursive expansion (RE) method has been applied on the Petersen networks [17]. The samemethod can be applied on the (3, 3)-networks as well, and the new hierarchical networks can becalled as recursively expanded (3, 3)-networks (RE (3, 3)-networks). Each -dimensional RE (3, 3)-network can have up to
Q nodes. Furthermore, the degree of each node is 6.
Consequently, RE (3, 3)-networks can aviod the bottlenecks caused by the root nodes of root-folded (3, 3)-networks.
3.3.1 Definition and Properties
Let the basic (3, 3)-network be G. A -dimensional RE (3, 3)-networks L can be defined by thefollowing recursive expansion method:
Definition 4 (Recursively Expanded (3, 3)-Networks)
let An -dimensional network is formed by connecting the nodes with the address
of the 20 subnetworks L
into a (3, 3)-network, where 4
.
Consequently, every node with the address 8 of an -dimensional RE (3, 3)-network
has 6 edges, 3 of which are edges of level 1 (i.e. within the same ) and the rest 3 of which
are connections within . On the other hand, all the nodes with the address greater than only
have 3 edges. Figure 5 shows a 2-dimensional RE (3, 3)-network.
9
Figure 5: A Two-Dimensional RE (3, 3)-Network 3.3.2 Routing and Broadcasting Algorithms
The routing algorithm for RE (3, 3)-networks can be adapted from the routing algorithm of basic(3, 3)-networks.Algorithm L -Routing
The addresses of source and destination node and are represented as K
and K , respectively R
beginfor
to
do L -Basic-Routing( @ )end
Algorithm L -Routing
Algorithm L -Basic-Routing( )begin
while doBasic-Routing( ; ) [
; : the node at level 0 with links to nodes of L ]
set
if #Neighbors( ) then forward
to ; set
else if ( Neighbors( ) T Neighbors( )) then
forward
to ; set else begin
for each element # Neighbors( )
10
if Neighbors( ) then
forward
to ; set end for
end ifend do
endAlgorithm L -Basic-Routing
The time complexity of the above routing algorithm for -dimentional root-folded (3, 3)-networksis .
Similarly, the broadcasting algorithm for an -dimentional RE (3, 3)-network can be obtainedby extanding the broadcasting algorithm for the basic (3, 3)-networks.Algorithm L -Broadcasting( )begin
L -Basic-Broadcasting-All( G ) 8G
parallel dobegin
L -Broadcasting(the first neighbor of @ )L -Broadcasting(the second neighbor of @ )
...L -Broadcasting(the last neighbor of )
endparallel do
end
Algorithm L -Broadcasting
Algorithm L -Basic-Broadcasting-All( @G )begin
ifG
then stopelse
beginfor each neighboring node of whose flag is not equal to the address of do
beginis forwarded from to
The flag of is set to the address of end
for
Return (
G G )
endelse
end
Algorithm L -Broadcasting-All
The time complexity of the above algorithm is .
11
Figure 6: A Three-Dimensional Flooded (3, 3)-Network .3.4 Flooded (3, 3)-Networks
A new form of hierarchical (3, 3)-networks, called flooded (3, 3)-networks, will be presented. Itis recursively expanded similarly to the way RE (3, 3)-networks are expanded from basic (3, 3)-networks.
3.4.1 Definition and Properties
Definition 5 Flooded (3, 3)-Networks
let ) Each node of ) is connected with 19 nodes to form a basic (3, 3)-network, and the
resulted network will be a ) network.
Each of nodes of ) will be connected to 19 nodes to form a basic (3, 3)-
network, and the result of whole network will be a . network.
Figure 6 depicts a 3-dimensional flooded (3, 3)-network . . When , ) is exactly
a 1-dimensional RE (3, 3)-network L . However, flooded (3, 3)-networks can be expanded toinfinite dimensions.
12
An -dimensional flooded (3, 3)-network . can accommodate totally nodes. Every
leave node of ) (i.e. nodes at dimension ) is connected to 3 neighboring nodes, while theinternal nodes have 6 neighbors each. Therefore, the average degree of an -dimensional flooded(3, 3)-network will be ! 4
3.4.2 Routing and Broadcasting Algorithms
Algorithm . -Routing( )The addresses of source and destination nodes and are represented as K
and K , respectively.
beginfor
to
do . -Basic-Routing( )where
for
to do . -Basic-Routing( )end
Algorithm ) -Routing
Algorithm . -Basic-Routing( @ )begin
while doBasic-Routing( ; ) [
; : the port node from level
to level
]
set
if #Neighbors( ) then forward
to ; set
else if ( Neighbors( ) T Neighbors( )) then
forward
to ; set else begin
for each element # Neighbors( ) if Neighbors( )
thenforward
to ; set
end forend if
end doend
Algorithm ) -Basic-Routing
Algorithm . -Broadcasting( )begin
for G
to
do Basic-Routing( " @ )for
to
do Basic-Broadcasting( )end
Algorithm ) -Broadcasting
13
Both ) -Routing and . -Broadcasting have the time complexity of .
4 Evaluation
This paper uses the following topological properties to compare the hierarchical interconnectionnetworks presented in previous section with some popular hierarchical networks: diameters, de-grees, connectivity, and cost. Figure 7 lists the topological properties of various hierarchical net-works. The maximum number of links on nodes of the Petersen-based, Heawood-based and (3, 3)-based hierarchical networks that are expanded based on the same method are the same, but the(3, 3)-based hierarchical networks can have more nodes to accommodate processors for multipro-cessor computers. The costs of hypercube, folded Petersen, and folded (3, 3)-are , whichmakes them unsuitable to serve as the underlying interconnection network of large multiprocessorsystems, while the rest hierarchical networks in Figure 7 seem to be good candidates. In addition,the flooded (3, 3)-network is developed to be suitable for integrated-circuit implementation for itssuperior topological properties over the other networks: low cost, short diameter, regular topology,high scalability, and a small number of crossing edges. It is worthy noting that the topologicalstructure of flooded (3, 3)-networks can be applied to the hierarchical Petersen networks to obtainlower cost, shorter diameter, and a even smaller number of crossing edges. Without loss of gener-ality, it is assumed that the numbers of nodes in these hierarchical (3, 3)-networks are power of 20.The larger power is considered if the network size is in between two powers of 20 without loss ofthese properties.
5 Conclusion
This paper presented several hierarchical interconnection networks, derived from the (3, 3)-graph,for high-performance multicomputer systems. All the Heawood-based hierarchical networks in thepaper have the following nice properties: regular topology, high scalability, and small diameter.Furthermore, this paper also demonstrated that the routing and broadcasting algorithms for thesehierarchical networks were as elegant as those for the Petersen-based networks and hypercubes.
References
[1] S. Abraham an K. Padmanabhan, “An analysis of the twisted cube topology,” InternationalConference on Parallel Processing, Vol.1, pages 116-120, 1989.
[2] B. Bollobas, Extremal Graph Theory, Academic Press, London, 1978.
[3] S.K. Das, “Designing recursive networks combinatorially,” International Conference onGraph Theory, Combinatorics, and Computing, Baton Rouge, LA, Feb, 1991
14
# of Nodes Degree Diameter Cost
Hypercube
CCC 3
Folded Petersen4
Root-Folded Petersen4 3
Recursively Expanded Petersen4 6
Folded Heawood Root-Folded Heawood or 3
Recursively Expanded Heawood 6
Flooded Heawood
Folded Graph
Rooted-Folded Graph or 3
Recursively Expanded Graph 6
Flooded Graph 3.15
Figure 7: Topological Properties of Various Hierarchical Networks
[4] S.K. Das and A. Mao, “Embeddings in recursive combinatorial networks,” Workshop Graph-Theoretic Concepts in Computer Science, Wiesbadan, Germany, June 18-20, 1992
[5] S. K. Das, S. Ohring, and A. K. Banerjee, “Embeddings into hyper Petersen networks: Yetanother hypercube-like interconnection topology,” Journal of VLSI, special issue on intercon-nection networks, vol. 2, no. 4, pp. 335-351, 1995.
[6] A.H. Esfahanian, L.M. Ni, and B.E. Sagan, “On enhancing hypercube multiprocessors,”International Conference on Parallel Processing, pages 86-89, 1988
[7] P.A.J. Hilbers, M.R.J. Koopman, and J.L.A. van de Snepacheut, “The twisted cube,” Proc.Parallel Architectures and Algorithms Europe, pages 152-159, june 1987
[8] Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability,New York: McGraw-Hill, 1993.
[9] Gene Eu Jan, Yuan-Shin Hwang, Ming-Bo Lin, and Deron Liang, “Noval hierarchical inter-connection networks for high-performance multicomputer systems,” to appear in Journal ofInformation Science and Engineering (under minor revision).
15
[10] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis, “Introduction to Paral-lel Computing: Design and analysis of Algorithms,” Redwood City, California: The Ben-jamin/Cummings Publishing Company, Inc., 1994.
[11] Ming-Bo Lin and Gene Eu Jan, “Routing and broadcasting algorithms for the root-foldedpetersen networks,” Journal of Marine Science and Technology, Vol.6, No.1, pages 65-70,1998.
[12] James A. McHugh, Algorithmic Graph Theory, Englewood Cliffs, New Jersey: Prentice-Hall,1990.
[13] Sabine Ohring and Sajal K. Das, “The folded Petersen network: A new communication-efficient multiprocessor topology,” 1993 International Conference on Parallel Processing,Vol. I, pp. 311-314, Aug. 1993.
[14] Sabine Ohring and Sajal K. Das, “Folded Petersen cube networks: new competitors for thehypercubes,” IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 2, pp. 151-168,February 1996.
[15] Robert W. Pratt, “The Complete Catalog of 3-Regular Diameter-3 Planar Graphs,” Depart-ment of Mathematics, University of North Carolina at Chapel Hill, 1996.
[16] F. Preparata and J. Vuillemin, “The cube-connected cycles: a versatile network for parallelcomputation,” Communications of the ACM, Vol.24, No.5, pages 300-309, 1981.
[17] H. Shen, “A high performance interconnection network for multiprocessor,” Parallel Com-puting, pages 993-1001, Sep. 1992.
[18] L. Uhr, Multicomputer architecture for artificial intelligence, Wiley Interscience, New York,1987.
16
Session 6: Distributed Computing
1/18
HIGH SYSTEM AVAILABILITY USING NEIGHBOR REPLICATION ON GRID
M. MAT DERIS, A.NORAZIAH, M.Y. SAMAN, A.NORAIDA, * Y.W.YUAN,
University College Terengganu,Faculty of Science and Technology , 21030,MengabangTelipot,Kuala Terengganu,Malaysia.
E-mail: [email protected] [email protected] [email protected]
*Department of Computer Science and Technology,
Zhuzhou Institute of technology ,Hunan, China,412008 E-mail : [email protected]
Abstract
Data Replication can be used to improve the system availability of distributed systems. In such a system, a mechanism is required to maintain the consistency of the replicated data. The grid structure technique based on quorum is one of the solutions to perform this while providing a high availability of the system. It was shown in the study that, it still requires a bigger number of copies be made available to construct a quorum. So it is not suitable for large systems. In this paper, we propose a technique called the neighbor replication on grid (NRG) technique by considering only neighbors to have the replicated data. In comparison to the grid structure technique, NRG requires a lower communication cost for an operation, while providing a higher system availability, which is preferred for large systems.
Keywords : replicated data, binary grid assignment, storage capacity, communication cost, system availability.
1. INTRODUCTION
Replication is a useful technique for distributed systems, where it increases the data
availability and accessibility to users despite site and communication failures [1,2,3]. It is
an important mechanism because it enables organizations to provide users with access to
current data where and when they need it. Availability refers to the probability that the
2/18
system is operational according to its specification at a given point in time. System
availability refers to the availability of a read or/and write operations of the system.
To maintain the consistency and integrity of data in distributed database systems,
expensive synchronization mechanisms are needed. One of the simplest techniques for
managing replicated data is read-one write-all (ROWA) technique [3]. Read operations on
a data object are allowed to read any copy, and write operations are required to write all
copies of the data object. This technique has been proposed for managing data in mobile
and peer-to-peer (P2P) environment [2]. However, it provides read operations with high
degree of availability at low cost but severely restricts the availability of write operations
since they cannot be executed at the failure of any copy. This technique results in the
imbalance of availability an also the communication cost of read and write operations.
The read operations have a high availability and low communication cost whereas write
operations have a low availability with higher communication cost. Voting techniques [7]
became popular because they are flexible and are easily implemented. One weakness of
these techniques is that writing an object is fairly expensive: a write quorum of copies
must be larger than the majority of votes. To optimize the communication cost and the
availability of both read and write operations, the quorum technique generalizes the
ROWA technique by imposing the intersection requirement between read and write
operations [5]. Write operations can be made fault-tolerant since they do not need to
access all copies of data objects. Dynamic quorum techniques have also been proposed
to further increase availability in replicated databases [8,9]. However, these approaches
do not address the issue of low-cost read operations. Another technique, called Tree
quorum technique [10,11] uses quorums that are obtained from a logical tree structure
imposed on data copies. However, this technique has some drawbacks. If more than a
majority of the copies in any level of the tree become unavailable, write execution cannot
be executed.
Recently, several researchers have proposed logical structures on the set of copies in
the database, and using logical information to create intersecting quorums. The technique
that uses a logical structure such as grid structure technique, executes operation with low
communication costs while providing fault-tolerance for both read and write operations
[6]. However, this technique still requires that a bigger number of copies be made
3/18
available to construct a quorum.
In this paper, without loss of generality, the terms node and site will be used
interchangeably. If data is replicated to all sites, the storage capacity becomes an issue,
thus an optimum number of sites to replicate the data is required with non-tolerated
availability of the system. The focus of this paper is on modeling a technique to optimize
the communication costs and the system availability of the replicated data in the
distributed systems.
We describe the Neighbor Replication on Grid (NRG) technique by considering only
neighbors obtain a data copy. For simplicity, the neighbors are assigned with vote one and
zero otherwise. This assignment provides a minimum communication cost with higher
system availability due to the minimum number of quorum size required. Moreover, it
can be viewed as an allocation of replicated copies such that a copy is allocated to a site if
and only if the vote assigned to the site is one. This approach is particularly useful for
large systems and also for P2P under the fixed network environment.
This paper is organized as follows: In Section 2, we review the three quorum and the
grid structure techniques which are then compared to the proposed technique. In Section
3, the model and the technique of the NRG is presented. Section 4, analyzed the
performance of the proposed technique is in terms of communication cost and availability,
and a comparison with grid technique is given. The conclusion is presented in section 5.
2. REVIEW OF REPLICA CONTROL TECHNIQUE
In this section, we review some of the replica control techniques namely the Tree
Quorum (TQ) technique, and Grid Structure (GS) technique, which are then compared to
the proposed NRG technique.
2.1 Tree quorum technique
Agrawal and El-Abbadi [10] proposed a logical tree structure over a network of sites
to achieve fault-tolerant distributed mutual exclusion. One advantage of this protocol is
that a read operation may access only one copy and the number of copies to be accessed
by a write operation is always less than a majority of quorum, i.e., the size of quorum
4/18
may increase to maximum of n/2 +1 sites when failures occur. Write operations in the tree
protocol must access a write quorum which is formed from the root, a majority of its
children, and a majority of their children, and so forth until the leaves of the tree are
reached. In Figure 1, an example of a ternary tree with thirteen copies is illustrated. For
read operations, a read quorum can be formed by the root or the majority of the children
of the root. If any node in the majority of the children of the root fails, it can be replaced
by the majority of its children, and so on recursively can replace it. In the best case, a
read quorum consists of only root, 1. As the root fails, a quorum is formed by the
majority of the copies at level 1, e.g., 2,3,3,4, or 2,4. The majority of their children
replace nodes 2 and 3, respectively, if no majority at level 1 is accessible and only site 4
is accessible. Such quorums are 4,5,6 or 4,9,10. If copies at level 0 and level 1 fail, a
quorum is formed by the majorities of the children of the selected majority at level 1, e.g.,
5,7,8,10 or 9,10,11,12. For write operations, the size of a write quorum for a given
tree is fixed, but the members can be different. Such quorums are
1,2,3,5,6,8,10,1,3,4,9,10,11,13, etc.
The desirable aspects of this protocol are that the cost of executing read operations is
comparable to the ROWA protocol while the availability of write operations is
significantly better [1]. There are some drawbacks from this protocol. As an example, if
more than a majority of the copies at any level of the tree become unavailable, write
operations cannot be executed.
1 2 3 4
5 6 7 8 9 10 11 12 13
5/18
Figure 1. A tree organization of 13 copies of an object
2.2 Grid Structure technique
Maekawa [13] has proposed a technique for a grid structure by using the notion of
finite projective planes to obtain a distributed mutual exclusion algorithm where all
quorums are of equal size. Cheung et al. [14], extended Maekawa's grid technique for
replicated data that supports read and write operations. In this technique, if there are n
sites in the system, all sites are logically organized in the form of a two-dimensional
√n x √n grid structure as shown in Figure 2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Figure 2. A grid organization with 25 copies of a data object
Read operations on the data item are executed by acquiring a read quorum that consists of
a copy from each column in the grid. Write operations, on the other hand, are executed by
acquiring a write quorum that consists of all copies in one column and a copy from each
of the remaining columns. The read and write operations of this technique is of the size
O(√n); this technique is normally referred to as the sqrt(R/W) technique. In Figure 2,
copies 1,2,3,4,5 are sufficient to execute a read operation whereas copies
1,6,11,16,21,7,13,19,25 will be required to execute a write operation. However, it still
has a bigger number of copies for read and write quorums, thereby degrading the
communication cost and data availability. It is also vulnerable to the failure of entire
column or row in the grid [6].
The notion of quorums has been generalized by Agrawal and Abbadi [6] in the
6/18
context of the grid structure. In general, a grid quorum has a length l and a width w, and
is represented as a pair <l,w>. A read operation is executed by accessing a read grid
quorum <l,w>, which is formed from l copies in each of w different columns. A write
operation is executed by writing a write grid quorum, <l,w>, which is formed from: a) l
copies in each of w columns as well as; b) any n - l +1 copies in each of n-w+1 columns.
The second component of the write quorum is to ensure the intersection between two
write quorums. In order to ensure the quorum intersection property between read and
write quorums is maintained, if the read grid quorum is of size <l,w> then the write grid
quorum must be <n-l+1,n-w+1>.
3. NEIGHBOR REPLICATION ON GRID (NRG) TECHNIQUE
3.1 Model A distributed database consists of a set of data objects stored at different sites in a
computer network. Users interact with the database by invoking transactions, which are
partially ordered sequences of atomic read and write operations. The execution of a
transaction must appear atomic: a transaction either commits or aborts [15,16].
In a replicated database, copies of a data object may be stored at several sites in the
network. Multiple copies of a data object must appear as a single logical data object to
the transactions. This is termed as one-copy equivalence and is enforced by the replica
control technique. The correctness criteria for replicated database is one-copy
serializability [17], which ensures both one-copy equivalence and the serializable
execution of transactions. In order to ensure the one-copy serializability, a replicated data
object may be read by reading a quorum of copies, and it may be written by writing a
quorum of copies. The selection of a quorum is restricted by the quorum intersection
property to ensure one-copy equivalence: For any two operations o[x] and o'[x] on a data
object x, where at least one of them is a write, the quorum must have a non-empty
intersection. The quorum for an operation is defined as a set of copies whose number is
sufficient to execute that operation.
Briefly, a site i initiates a NRG transaction to update its data object. For all accessible
data objects, a NRG transaction attempts to access a NRG quorum. If a NRG transaction
7/18
gets a write quorum with non-empty intersection, it is accepted for execution and
completion, otherwise it is rejected. We assume for the read quorum, if two transactions
attempt to read a common data object, read operations do not change the values of the
data object. Since read and write quorums must intersect and any two NRG quorums
must also intersect, then all transaction executions are one-copy serializable.
3.2 The NRG Technique
In NRG, all sites are logically organized in the form of a two-dimensional grid
structure. For example, if a NRG consists of twenty-five sites, it will be logically
organized in the form of 5 x 5 grid as shown in Figure 1. Each site has a master data file.
In the remainder of this paper, we assume that replica copies are data files. A site is
either operational or failed and the state (operational or failed) of each site is statistically
independent to the others. When a site is operational, the copy at the site is available;
otherwise it is unavailable.
Definition 3.2.1: A site X is a neighbor to site Y, if X is logically-located adjacent to Y.
A data will replicate to the neighboring sites from its primary site. The number of data
replication, d, can be calculated using Property 3.2, as described below.
Property 3.2 : The number of data replication from each site, d ≤ 5.
Proof: Let n be a set of all sites that are logically organized in a two-dimensional grid
structure form. Then n sites are labeled m(i,j), 1≤i<√n, 1≤j<√n. Two way links will
connect sites n(i,j) with its four neighbors, sites m(i±1,j) and m(i,j±1), as long as there are
sites in the grid. Note that, four sites on the corners of the grid, have only two adjacent
sites, and other sites on the boundaries have only three neighbors. Thus the number of
neighbors of each site is less than or equal to 4. Since the data will be replicated to
neighbors, then the number of data replication from each site, d, is:
d ≤ the number of neighbors + a data from site itself
= 4+1 =5.
8/18
For example, from Figure 1, data from site 1 will replicate to site 2 and site 6 which are
its neighbors. Site 7 has four neighbors, which are sites 2, 6, 8, and 12. As such, site 7 has
five replicas
For simplicity, the primary site of any data file and its neighbors are assigned with
vote one and vote zero otherwise. This vote assignment is called neighbor binary vote
assignment on grid. A neighbor binary vote assignment on grid, B, is a function such that
B(i) ∈ 0,1, 1≤ i ≤n
where B(i) is the vote assigned to site i. This assignment is treated as an allocation of
replicated copies and a vote assigned to the site results in a copy allocated at the neighbor.
That is,
1 vote ≡ 1 copy.
Let
∑=
=n
)i(B
1i
LB
where, LB is the total number of votes assigned to the primary site and its neighbors and it
also equals to the number of copies of a file allocated in the system. Thus, LB = d.
Let r and w denote the read quorum and write quorum, respectively. To ensure that
the read operation always gets up-to-date values, r + w must be greater than the total
number of copies (votes) assigned to all sites. The following conditions are used to
ensure consistency:
1) 1≤ r≤LB, 1≤ w≤LB,
2) r + w = LB +1.
Conditions (1) and (2) ensure that there is a nonempty intersection of copies between
every pair of read and write operations. Thus, the conditions ensure that a read operation
can access the most recently updated copy of the replicated data. Timestamps can be used
to determine which copies are most recently updated.
Let S(B) be the set of sites at which replicated copies are stored corresponding to the
assignment B. Then
S(B) = i| B(i) = 1, 1≤ i≤n.
9/18
Definition 3.2.2: For a quorum q, a quorum group is any subset of S(B) whose size is
greater than or equal to q. The collection of quorum group is defined as the quorum set.
Let Q(B,q) be the quorum set with respect to the assignment B and quorum q, then
Q(B,q) = G| G⊆S(B) and |G| ≥q
For example, from Figure 2, let site 8 be the primary site of the master data file x. Its
neighbors are sites 3, 7, 9 and 13. Consider an assignment B for the data file x, such that
Bx(8)=Bx(3)=Bx(7)=Bx(9)=Bx(13) = 1
and LBx= Bx(8)+ Bx(3)+ Bx(7)+ Bx(9)+ Bx(13) = 5.
Therefore, S(Bx) = 8,3,7,9,13.
If a read quorum for data file x, r =2 and a write quorum w = LBx-r+1 = 4, then the
quorum sets for read and write operations are Q(Bx,2) and Q(Bx,4), respectively, where
Q(Bx,2) = 8,3,8,7,8,9,8,13,8,3,7,8,3,9,8,3,13,8,7,9,8,7,13,
8,9,13,3,7,9,3,7,13,3,9,13,7,9,13,8,3,7,9,8,3,7,13,
8,3,9,13, 8,7,9,13, 3,7,9,13, 8,3,7,9,13
and
Q(Bx,4) = 8,3,7,9,8,3,7,13,8,3,9,13, 8,7,9,13, 3,7,9,13, 8,3,7,9,13
3.3 The Correctness of NRG
In this section, we will show that the NRG technique is one-copy serializable. We
start by defining sets of groups called coterie [13] and to avoid confusion we refer the
sets of copies as groups. Thus, sets of groups are sets of sets of copies.
Definition 3.3.1. Coterie. Let U be a set of groups that compose the system. A set of
groups T is a coterie under U iff
i) G ∈ T implies that G ≠ ∅ and G ⊆ U.
ii) If G, H ∈ T then G ∩ H ≠ ∅ ( intersection property)
10/18
iii) There are no G, H ∈ T such that G ⊂ H (minimality).
By definition of coterie and from Definition 3.2.2, then Q(B,w) is a coterie, because
it satisfies all coterie's properties.
Since read operations do not change the value of the accessed data object, a read
quorum does not need to satisfy the intersection property. To ensure that a read operation
can access the most recently updated copy of the replicated data, two conditions in sub-
section 3.2 must be conformed. While a write quorum needs to satisfy read-write and
write-write intersection properties. The correct criterion for replicated database is one-
copy serializable. The next theorem provides us with a mechanism to check whether
NRG is correct.
Theorem 3.3.1. The NRG technique is one-copy serializable.
Proof: The theorem holds on condition that the NRG technique satisfies the quorum
intersection properties, i.e., write-write and read-write intersections. Since Q(B,w) is a
coterie then it satisfies the write-write intersection. However, for the case of a read-write
intersection, it can be easily shown that ∀G∈Q(B,r) and ∀H∈Q(B,w), then G ∩ H ≠ ∅ .
4. PERFORMANCE ANALYSIS AND COMPARISON
In this section, we analyze and compare the performance of the NRG technique with
grid structure technique on the communication cost and the system availability.
4.1 Communication Costs and Availability Analysis
The communication cost of an operation is directly proportional to the size of the
quorum required to execute it. Therefore, we represent the communication cost in terms
of the quorum size. In estimating the availability, all copies are assumed to have the same
availability p. CX,Y denotes the communication cost with X technique for Y operation,
11/18
which is R(read) or W(write).
4.1.1 Grid Structure (GS) technique
Let n be the number of copies which are organized as a grid of dimension √n x √n.
Read operations on the replicated data are executed by acquiring a read quorum that
consists of a copy from each column in the grid. Write operations are executed by
acquiring a write quorum that consists of all copies in one column and a copy from each
of the remaining columns, as discussed in Section 2.Thus, the communication cost,
CGS,R ,can be represented as:
CGS,R = √n,
and the communication cost, CGR,W , can be represented as:
CGS,W = √n + (√n -1) = 2√n - 1.
Let AX,Y be the availability with X technique for Y operation. In the case of the quorum
technique, read quorums can be constructed as long as a copy from each column is
available. Then, the read availability, AGS,R, is the probability of the system able to do
read operation at a given point in time. For the GS technique, AGS,R, as given in [4] is:
= [1-(1-p)√n]√n (1)
On the other hand, write quorums can be constructed as all copies from a column and one
copy from each of the remaining columns are available. Then, the write availability is the
probability of the system able to do write operation at a given point in time. For the GS
technique, AGS,W, as given in [4] is:
= [1-(1-p)√n]√n - [(1-(1-p)√n-p√n]√n. (2)
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
⎟⎠⎞
⎜⎝⎛
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
∑=
− −=
n
ip inp i
in
n
11
( )⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
⎟⎟
⎠
⎞⎜⎜
⎝
⎛
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
±+−−−
−−−
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−= p n
np np n nn
p np nn
n ...2
112
211
1
12/18
4.1.2 Tree quorum (TQ) technique
Let h denotes the height of the tree, D is the degree of the copies in the tree, and M =
⎡(D+1)/2⎤ is the majority of the degree of copies. When the root is accessible, the read
quorum size is 1. As the root fails, the majority of its children replace it, thus the quorum
size increases to M. Therefore, for a tree of height h, the maximum quorum size is Mh.
Hence, the cost of read operation, CTQ,R, ranges from 1 to Mh [6,11] i.e., 1≤CTQR≤ Mh.
The cost of a write operation, CTQ,W , can be represented as: CTQW = ∑Mi, i = 1…h. (3)
4.1.3 The NRG technique
Let pi denote the availability of site i. Read operations on the replicated data are
executed by acquiring a read quorum and write operations are executed by acquiring a
write quorum. For simplicity, we choose the read quorum equals to the write quorum.
Thus, the communication cost for read and write operations equals to ⎣LBx/2⎦, that is,
CNRG,R = CNRG,W = ⎣LBx/2⎦. For example, if the primary site has four neighbors, each of
which has vote one, then CNRG,R = CNRG,W == ⎣5/2⎦ = 3.
For any assignment B and quorum q for the data file x, define ϕ(Bx,q) to be the
probability that at least q sites in S(Bx) are available, then
ϕ( Bx,q) = Prat least q sites in S(Bx) are available
= ∑ ∏ ∏∈ ∈ −∈
⎟⎟⎠
⎞⎜⎜⎝
⎛−
)q,B(QG Gj G)B(Sjj
x x
)1( Pp j (4)
Thus, the availability of read and write operations for the data file x, are ϕ(Bx,r) and
ϕ(Bx,w), respectively. Let Av(Bx,r,w) denote the system availability corresponding to the
assignment Bx, read quorum r and write quorum w. If the probability that an arriving
operation of read and write for data file x are f and (1-f), respectively, then
Av(Bx,r,w) = f ϕ(Bx,r) + (1-f) ϕ(Bx,w).
Since in this paper, we consider the read and write operations are equal, then the system
13/18
availability,
Av(Bx,r,w) = ϕ(Bx,r) = ϕ(Bx,w). … (5)
Definition 4.1.3. Let Av(x) be the availability function with respect to x. Av(x) is in the
closed form if 0≤x≤1 then 0≤Av(x)≤1.
Theorem 4.13. The system availability under NRG technique are in the closed form.
Proof: For the case of read availability, from equation (4) and by definition 4.1.3, as
0≤pi≤1, i=1,2,…,LBx then 0≤ ϕ(Bx,r)≤1. Similarly, for the case of write availability where
0≤ϕ(Bx,w)≤1 as 0≤ pi ≤1.
4.2 Performance Comparisons
4.2.1 Comparison of costs
The communication cost of an operation is directly proportional to the size of the
quorum required to execute the operation. Therefore, we represent the communication
cost in terms of the quorum size. Table 1 shows the read and write costs of the two
techniques between NRG and logical Grid Structure (GS) for different total number of
copies, n = 25,36,49,64, and 81. The compared data of GS is derived from Agrawal and
Abbadi [2].
Table 1. Comparison of the read and write cost between GS and NRG under the different set of copies
Number of copies in the system 25 36 49 64 81
GS (R) 5 6 7 8 9 GS(W) 9 11 13 15 17 TQ(R) 8 8 11 16 16 TQ(W) 13 15 19 27 31 NRG(R/W) 3 3 3 3 3
From Table 1, it is apparent that NRG has the lowest cost for both read and write
operations in spite of having a bigger number of copies when compared with TQ and GS
quorums. It can be seen that, NRG needs only 3 copies for the read and write quorums for
all instances. On the contrary, for TQ with a tree of high 3 on 36 copies, the read cost is 8,
14/18
while for GS the cost is 6 with 36 copies. The write cost is 19 and 13 for TQ and GS
respectively, with 49 copies. These increase gradually as a number of copies increases.
4.2.2 Comparisons of system availabilities
In this section, we will compare the performance on the system availability of the TQ
and GS techniques based on equations (1), (2) and (3), and our NRG technique based on
equations (4) and (5) for the case of n=36, 49, and 64. In estimating the availability of
operations, all copies are assumed to have the same availability.
Figure 3 and Table 2 show that the NRG technique outperforms the TQ and GS
techniques. When an individual copy has availability 70%, write availability in the NRG
is approximately 84% whereas write availability in the TQ and GS is approximately 25%
and 56%, respectively for n=36. Moreover, write availability in the TQ and GS decreases
as n increases.
Table 2. Comparison of the write availability between TQ, GS and NRG under the different set of copies p=0.1,…,0.9
Write Availability Techniques
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
NRG(R/W) .009 .058 .163 .317 .500 .683 .837 .942 .991
GS(W),n=36 .000 .000 .002 .019 .083 .244 .526 .838 .989
GS(W),n=64 .000 .000 .000 .005 .030 .126 .378 .770 .988
TQ(W),n=36 2E-12 4E-8 1E-5 0.001 0.009 0.063 0.248 0.568 0.853
TQ(W),n=64 8E-25 9E-16 1E-10 4E-7 0.0001 0.007 0.107 0.489 0.847
00.20.40.60.8
11.2
0.1 0.3 0.5 0.7 0.9
Availability of a data copy
ope
ratio
n av
aila
bilit
y
NRGGS,N=36GS,N=64TQ,N=36TQ,N=64
15/18
Figure 3. Comparison of the write availability between TQ, GS and NRG
Table 3. Comparison of the system availability between TQ, GS and NRG
under the different set of copies with p= 0.7 and f =0.1,…,0.9 System Availability
Techniques0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
NRG .837 .837 .837 .837 .837 .837 .837 .837 .837
GS,N=36 .573 .620 .667 .714 .761 .808 .855 .902 .948
GS,N=64 .440 .502 .564 .627 .689 .751 .813 .875 .937
TQ,N=36 .323 .399 .474 .549 .624 .699 .774 .850 .925
TQ,N=64 .100 .286 .375 .464 .554 .643 .732 .821 .911
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9the probability of read of an
operation
sys
tem
ava
ilabi
lity
NRGGS,N=36GS,N=64TQ,N=36TQ,N=64
16/18
Figure 4. Comparison of the system availability between TQ, GS and NRG for p = 0.7
The NRG technique also outperforms the TQ and GS techniques when the
probability of an arriving operation is read, f ≤ 0.6, as shown in Figure 4 and Table 3.
This shows that the system availability is very sensitive to the read/write probability. For
example, when f = 0.5, system availability in TQ and GS for N = 36 is approximately
62% and 76% respectively, whereas system availability in the NRG is approximately
84% for any instances. Moreover, system availability in the TQ and GS decrease as n
increases. This is due to the decrease of write availability as n increases.
5. CONCLUSIONS
In this paper, a new technique, called neighbor replication on grid (NRG) has been
proposed to manage the data replication in distributed database systems. The analysis of
the NRG technique was presented in terms of system availability and communication
costs. It showed that, the NRG technique provides a convenient approach to achieve a
high availability for update-frequent operations by imposing a neighbor binary vote
assignment to the logical grid structure on data copies. This is due to the minimum
number of quorum size required. This technique also provides an optimal neighbor binary
vote assignment which is less computational time required. In comparison to the Tree
Quorum and Grid structure techniques, NRG requires significantly lower communication
cost for an operation, while providing higher system availability, which is preferred for
large systems. Thus, this technique represents an alternative design philosophy to the
quorum approach.
17/18
References
1. O. Wolfson, S. Jajodia, and Y. Huang,"An Adaptive Data Replication Algorithm,"ACM Transactions on Database Systems, vol. 22, no 2 (1997), pp. 255-314.
2. Budiarto, S. Noshio, M. Tsukamoto,”Data Management Issues in Mobile and Peer-
to-Peer Environment”, Data and Knowledge Engineering, Elsevier, 41 (2002), pp. 183-204.
3. K. Ranganathan, A. Iamnitchi, I. Foster,” Improving Data Availability through
Dynamic Model-Driven Replication in Large Peer-to-Peer Communities”, Global and P2P Computing on Large Scale Distributed Systems Workshop, Berlin, pp. 1-6,May 2002.
4. R.H. Thomas,”A majority concensus approach to concurrency control for multiple
copy databases”, ACM Transactions on Database Systems, 4(2):180-209, June 1979
5. D.K. Gifford,”Weighted voting for replicated data”, Proc. 7th Symp. On Operating
System Principles, pg 150-162, Dec. 1979.
6. D Agrawal and A.El Abbadi, "Using Reconfiguration For Efficient Management of Replicated Data,"IEEE Trans. On Knowledge and Data Engineering,vol.8,no. 5 (1996), pp. 786-801.
7. S.M. Chung, "Enhanced Tree Quorum Algorithm For Replicated Distributed
Databases," Int'l Conference on Database , Tokyo, 1990, pp. 83-89.
8. M.T. Ozsu and P. Valduriez,Principles of Distributed Database Systems, 2nd Ed., Prentice Hall, 1999.
9. J.F. Paris and D.E. Long, "Efficient Dynamic Voting Algorithms," Proc. Fourth
IEEE Int'l Conf. Data Eng, Feb.1988,pp. 268-275.
10. D Agrawal and A.El Abbadi, "The Tree Quorum technique: An Efficient Approach for Managing Replicated Data,"Proc.16th Int'l Conf. On Very Large Data Bases ,Aug. 1990, pp. 243-254.
11. D Agrawal and A.El Abbadi, "The generalized Tree Quorum Technique: An
Efficient Approach for Managing Replicated Data,"ACM Trans. Database Systems,vol.17,no. 4(1992), pp. 689-717.
12. S.Jajodia and D. Mutchles,"Dynamic Voting Algorithms for Maintaining the
Consistency of a Replicated Database,"ACM Trans. Database Systems,vol 15,no. 2(1990), pp. 230-280.
18/18
13. M. Maekawa," A √n Algorithm for Mutual Exclusion in Decentralized
Systems,"ACM Trans. Computer Systems,vol. 3,no. 2(1992), pp. 145-159. 14. S.Y. Cheung, M.H. Ammar, and M. Ahmad,"The Grid Technique: A High
Performance Schema for Maintaining Replicated Data," IEEE Trans. Knowledge and Data Engineering,vol. 4, no. 6(1992), pp. 582-592.
15. B. Bhargava,"Concurrency Control in Database Systems," IEEE Trans. Knowledge
and Data Engineering,vol 11,no. 1(1999), pp.3-16.
16. P.A. Bernstein, V. Hadzilacos, and N.Goodman, Concurrency Control and Recovery in Database Syatems,Addison-Wesley,1987.
17. P.A. Bernstein and N.Goodman, "An Algorithm for Concurrency Control and
Recovery in Replicated Distributed Databases,"ACM Trans. Database Systems,vol 9,no. 4(1994),pp.596-615.
MPICH-GF: Transparent Checkpointing and Rollback-Recovery forGrid-enabled MPI Processes
Namyoon Woo and Heon Y. Yeom
School of Computer Science and EngineeringSeoul National UniversitySeoul, 151-742, KOREA
nywoo,[email protected]
Taesoon Park
Department of Computer EngineeringSejong University
Seoul, 143-747, [email protected]
Abstract
Fault-tolerance is an essential element to the distributed system which requires the reliable computation en-vironment. In spite of extensive researches over two decades, practical fault-tolerance systems have not beenprovided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper,we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives areto fill the gap between the theory and practice of fault-tolerance systems and to provide a checkpointing-recoverysystem for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process man-agement for MPI and message queue management. MPICH-GF requires no modification of application sourcecodes and affects the MPICH communication as less as possible. The features of MPICH-GF are that it supportsthe direct message transfer mode and that all of the implementation has been done at the lower layer, that is, thevirtual device level. We have evaluated MPICH-GF with NPB applications on Globus middleware.
1 Introduction
A ‘computational grid’ (shortly a grid) is a specialized instance of a distributed system, which includes aheterogeneous collection of computers in different domains connected by networks [17, 18]. The grid has attractedconsiderable attention for its ability to utilize ubiquitous computational resources with a view of a single systemimage and it has been believed to benefit many computation-intensive parallel applications. Recently ArgonneNational Laboratory has proposed the ‘Globus Toolkit’ for the framework of the grid and it has been taken asthe de-facto standard of grid services [16]. Main features of the Globus Toolkit are global resource allocationand management, directory service, remote task running, user authentication and so on. Although the GlobusToolkit can monitor and manage global resources, it lacks dynamic process management such as fault-toleranceor dynamic load balancing which is essential to the distributed system.
Distributed systems are not reliable enough to guarantee the completion of parallel processes in a determinatetime because of their inherent failure factors. The system consists of a number of nodes, disks and network lines,which are all exposed to failures. Even a single local failure can be fatal to the parallel processes since it nullifiesall of the computation results which have been executed in cooperation with one another. Assuming that thereis one percent chance a single machine might crash during the execution of a parallel application, there is onlya percent chance the whole system with one hundred machines would be alive throughout the the
1
application run. In order to increase the reliability of distributed systems, it is important to provide the system withfault-tolerance.
Checkpointing and rollback-recovery is a well-known technique for fault-tolerance. Checkpointing is an oper-ation to store the states of processes into stable storage for the purpose of recovery or migration [12]. A processcan resume its previous state at any time with the latest checkpoint file and periodic checkpointing can minimizethe computation loss incurred by a failure. Although several stand-alone checkpoint toolkits [23, 31, 38] havebeen implemented, they are not sufficient for parallel computing due to the following reasons: First, they cannotrestore communication states such as sockets or shared memory. Second, they do not consider causal relationshipamong the states of processes. Process states may be dependent on one another in the message-passing environ-ment [9, 27]. Hence, the independent process recovery without consideration on dependency may tangle the globalprocess states. The states of inter-process relations should be reconstructed for consistent recovery. Consistent re-covery for message-passing processes has been extensively studied for over two decades and several distributedalgorithms for consistent recovery have been proposed. However, implementation of the algorithms seems anotherissue because they assume the followings:
A process detects a failure and revives by itself.
Both revived and survived processes can communicate after recovery without any procedure of channelreconstruction.
Checkpointed binary files are always available.
Also, there have been many efforts to make the theory practical [19, 28, 10, 6, 13, 24, 34, 22, 36, 4, 33,30]. Their approaches take different strategies in the following context; system-level versus application-levelcheckpointing, kernel-level versus user-level checkpointing, user-transparency, support of non-blocking messagetransfer and direct versus indirect communication method. According to their strategies, implementation levelsdiffer. However, the frameworks remain to be proofs or evaluation tools of the recovery algorithms. To the best ofour knowledge there are only a few systems which are actually used in practice [19, 11], although they are onlyvalid for the specific parallel programming model or the specific machine.
Our goal in this paper is to construct the practical fault-tolerance system for message-passing applications ongrids. We have integrated rollback-recovery algorithm withMessage Passing Interface (MPI) [14], thede-factostandard of message-passing programming. As a research result, we present our MPICH-GF that is based onMPICH-G2 [15], the grid-enabled MPI implementation. MPICH-GF is completely transparent to the applicationdevelopers and it requires no modification of application source codes. Any in-transit message during checkpoint-ing is never lost whether it is for blocking operation or nonblocking operation. We have expanded the Globusjob manager module to support rollback-recovery and to control distributed processes over domains. Above all,our main implementation issue is to provide dynamic process management that is not defined in the original MPIstandard (version 1). Since the original MPI standard specifies only static process group management, a revivedprocess, regarded as a new instance, cannot rejoin the process group. In order to enable a new MPI instance tocommunicate with the running processes, we have implemented a ‘MPI Rejoin()’ function. Currently coor-dinated checkpointing protocol has been implemented and tested with MPICH-GF, and other consistent recoveryalgorithms (e.g. message logging) are under development. MPICH-GF operates on Linux kernel ver 2.4 withGlobus toolkit 2.2.
The rest of this paper is organized as follows: In Section 2, we present the concept of consistent recoveryand the related works of the fault-tolerance system. Section 3 describes the operation of the original MPICH-G2. We propose the MPICH-GF architecture in Section 4 and address implementation issues in Section 5. Theexperimental results of MPICH-GF with Nas Parallel Benchmarks is shown in Section 6 and finally the conclusionsare presented in Section 7.
2
2 Background
2.1 Consistent Recovery
In the message-passing environment, states of processes may have dependency relation with one another,through message-receipt events. A consistent system state is the one in which for every message-receipt eventreflected in the system state, the corresponding sending-event should be reflected [9]. If a process rolls back to apast state but any other process whose current state is dependent on the lost state does not roll back, inconsistencyoccurs.
1
1
2
2
P
: Checkpoint
P
mm
Figure 1. Inconsistent global checkpoint
Figure 1 shows a simple example of two processes whose local checkpoints do not form a consistent globalcheckpoint. Suppose that two processes should roll back to their latest local checkpoints. Then, process hasnot sent the message yet, but has marked as being received. In this case, becomes anorphanmessage and it causes inconsistency. Message is considered as alost message, in the sense that waits forthe arrival of which has marked as being already sent. If any in-transit message is not recorded duringcheckpointing, it becomes a lost message. Both of these message types cause the abnormal execution of processesduring recovery.
Extensive researches on consistent recovery have been conducted [12]. Approaches to the consistent recoverycan be categorized into the coordinated checkpointing, the communication induced checkpointing and the messagelogging. In coordinated checkpointing, processes synchronize their local checkpointing so that a set of consistentglobal checkpoints can be guaranteed [9, 21]. On failure all the processes roll back to the latest global checkpointsfor consistent recovery. However, it is a dominant idea that coordination would not scale up. Communicationinduced checkpointing (CIC) allows processes to checkpoint independently as well as prevents thedomino effectusing the information piggy-backed on the message [2, 12, 37]. In [2], Alvisiet al. disputed against the beliefthat CIC would be scalable. According to their experimental report, CIC generates enormous forced checkpointsand the process autonomy in placing local checkpoints does not seem to benefit in practice. Message loggingrecords messages with checkpoint files in order to replay them for the recovery. It can be further classified intopessimistic, optimistic and causal message logging according to their policy on how to store message logs [3].Log-based rollback recovery minimizes the amount of lost computation with high storage overhead.
2.2 Related Works
Fault-tolerant systems for message-passing processes can be categorized as whether they support direct orindirect message transfer. With direct message transfer, application processes communicate with other processesdirectly, that is, client-to-client communication is possible. MPICH implementation is the representative case ofdirect transfer mode. With indirect message transfer, messages are sent through the medium like daemon: forexample, PVM and LAM-MPI [7] are the cases. In these systems, processes do not have to know any physical
3
address of another process. Instead they maintain the connection with the medium and hence the recovery of thecommunication context is relatively easy. However, increase in message delay is inevitable.
CoCheck [36] is a coordinated checkpointing system for PVM and tuMPI. While CoCheck for tuMPI supportsdirect transfer mode, the PVM version exploits PVM daemon to transfer messages. CoCheck exists as a thinlibrary layered over PVM (or MPI) that wraps the original API. The process control messages are implementedat the same level as application messages. Unless an application process calls therecv() function explicitly, itcannot receive any control messages. MPICH-V [6] is a fault-tolerant MPICH version that supports pessimisticmessage logging. Every message is transferred to the remoteChannel Memory server (CM) that logs and replaysmessages. CMs are assumed to be stable so that the revived process can recover simply by reconnecting to theCMs. According to their literature, the main reasons of using CMs are to cope with volatile nodes and to keeplog data safely, even though the system should pay twice the cost for message delivery. FT-MPI [13] proposed byFagg and Dongarra supports MPI-2’s dynamic task management. FT-MPI has been built on PVM or HARNESScore library that exploits daemon processes. Li and Tsay also proposed their LAM-MPI based implementationwhere messages are transferred via a multicast server on each node [22]. MPI-FT proposed by Batchuet al. [4]adopts task redundancy to provide fault-tolerance. This system has a central coordinator that relays messages toall the redundant processes. MPI-FT from Cyprus University [24] adopts message logging. An observer processorcopies all the messages and reproduces them for the recovery. MPI-FT pre-spawns processes at the beginning sothat one of the spare processes can inherit a failed process. This system has a high overhead for the storage of allthe messages.
There are also some research results supporting direct message transfer mode. Starfish [1] is a heterogeneouscheckpointing toolkit based on Java virtual machine, which makes it possible for the processes to migrate amongheterogeneous platforms. The limits of this system are that they have to be written in OCaml and that byte codesrun more slowly than native codes. Egida [33] is an object-oriented toolkit that provides both communicationinduced checkpointing and message logging for MPICH withch p4 device. An event handler hijacks events ofMPI operations in order to do the corresponding pre-defined actions for the rollback-recovery. To support theatomicity of message transfer, it substitutes non-blocking operations with blocking ones. The master process withrank 0 is responsible for updating communication channel information on recovery. Therefore, the master processis not supposed to fail. The current version of Egida is able to detect only process failures and the failed process isrecovered at the same node as it previously run. As a result, Egida cannot manage hardware failures. Hector [34]is similar to our MPICH-GF. Hector exists as a movable MPI library, MPI-TM, and several executables. Thereare hierarchical process managers which create and migrate application processes. Coordinated checkpointing hasbeen implemented. Before checkpointing, every process closes its channel connection to ensure that there is noin-transit message left in the network. Processes have to reconnect the channel after checkpointing.
One technique to prevent in-transit messages from being lost is to put off checkpointing until all the in-transitmessages are delivered. In CoCheck, processes exchangeready-messages (RMs) with one another for the purposeof coordination and to guarantee the absence of in-transit messages [36]. RMs are sent through the same channelas the application messages. Since the channels are based on TCP sockets that satisfy FIFO property, RM’sarrival guarantees that all the messages earlier than RM have arrived at the receiver. In Legion MPI-FT [30], eachprocess reports the number of messages sent and received to the coordinator at the request of checkpointing. Thecoordinator calculates the number of in-transit messages and then waits for the processes informing that all thein-transit messages have arrived.
The other way to prevent the loss of in-transit messages is to build a user-level reliable communication protocolin which in-transit messages are recorded at the sender’s checkpoint file as being undelivered. Methet al havenamed this technique as ‘stop and discard’ in [25]. Messages received during checkpointing are discarded. Aftercheckpointing, discarded messages are re-sent by the user-level reliable communication protocol. This mechanismrequires one more memory copy at the sender side. RENEW [28] is an recoverable runtime system proposed byNeveset al. It also has the user-level reliable communication layers upon the UDP transport protocol in order to
4
log messages and to prevent the loss of in-transit messages.Some researches use application-level checkpointing for migration among heterogeneous nodes or for mini-
mization of checkpoint file size [5, 29, 30, 32, 35]. This type of checkpointing burdens application developerswith decision of when to checkpoint, what to store in checkpoint file and how to recover with the stored informa-tion. The CLIP toolkit [10] for the Intel Paragon providessemi-transparent checkpointing environment for users.The user must perform minor code modifications to define the checkpointing locations. Although the system-levelcheckpoint file is not heterogeneous itself, we believe that user transparency is an important virtue since it does notseem to happen that a process has to recover on a heterogeneous node even if there exists plenty of homogeneousnodes. In addition, application developers are not willing to accept such a programming effort.
To sum up, most of the systems have been implemented using indirect communication mode or they are validfor specialized platforms only.
3 MPICH-G2
In this section, we describe the execution of MPI processes on Globus middleware and the communicationmechanism of MPICH-G2.‘Message Passing Interface’ (MPI) is thede-facto standard specification for message-passing programming that abstracts low-level message-passing primitives away from the developer [14]. Amongseveral MPI implementations, MPICH [20] is the most popular for good performance and portability. Goodportability of MPICH can be attributed to the abstraction of low-level operations, the Abstract Device Interface(ADI), as shown in Figure 2. An ADI’s implementation is called a virtual device. Currently MPICH version 1.2.3includes about 15 virtual devices. Especially MPICH with a grid deviceglobus2 is called MPICH-G2 [15]. Justfor reference, our MPICH-GF has been implemented as the unmodified upper MPICH layer and our own virtualdevice ‘ft-globus’ originated fromglobus2.
Collective OperationBcast(), Barrier(), Allreduce() ...
MPIImplementation
Point−to−point operationSend()/Recv(), Isend()/Irecv(), Waitall()...
ADI . . .ch_shmem ch_p4 globus2 ft−globus
Figure 2. Layers of MPICH
3.1 Globus Run-time Module
Globus toolkit [16] proposed by ANL is thede-facto standard of the grid middleware. It contains directoryservice, resource monitor/allocation, data sharing, authentication and authorization. However, it lacks dynamicrun-time process control; for example, dynamic load balancing or fault-tolerance.
Figure 3 describes how MPI processes are launched on Grid middleware. There are three main Globus moduleswhich concern the process execution: DUROC (Dynamic Updated Request Online Co-allocator), GRAM (GlobusResource Allocation Management) job managers and a gatekeeper. DUROC distributes a user request to localGRAM modules and then a gatekeeper on each node checks whether the user is authenticated or not. If (s)heis an authenticated user, the gatekeeper launches a GRAM job manager. Finally the GRAM job manager forks
5
MPI App.
DUROC(Central Manager)
(Local Manager)
GRAMJob Manager
fork()
fork()
. . .Gatekeeper
Figure 3. The procedure of process launching in Globus
and executes requested processes. Neither DUROC nor GRAM job manager controls processes dynamically, butjust monitors them. Anyway, Globus has the fundamental framework of hierarchical process management. Fordynamic process management, we’ve expanded DUROC and GRAM job manager’s capability by modifying theirsource codes. We present their expanded abilities in Section 4 and from this context we use the terms, ‘central/localmanager’ instead of DUROC and GRAM manager respectively for the convenience.
3.2 globus2 Virtual Device
Collective communication in MPICH-G2 is implemented as a combination of point-to-point (shortly P2P) com-munications based on non-blocking TCP sockets and active polling. MPICH has P2P communication primitiveswith the following semantics:
Blocking operation :MPI Send(), MPI Recv()
Non-blocking operation :MPI Isend(), MPI Irecv()
Polling : MPI Wait(), MPI Waitall()
The blocking send-operation submits a send-request to the kernel and waits until the kernel copies send-messageto the kernel-level memory. The blocking receive-operation also waits until the kernel delivers the requestedmessage to the user-level memory, while the non-blocking operation only registers its request and exits. Actualmessage delivery for the non-blocking operation is not performed until a polling function is called. Indeed, theblocking operation is the combination of the non-blocking operation and the polling function in MPICH-GF.
Communication mechanism ofglobus2 device is similar to that ofch p4 device [8]. Each MPI process opensa ‘listener port’ in order to accept a request for channel opening. On the receipt of a request, the receiver opensanother socket and constructs a channel for two processes. All the listener information is transferred to everyprocess at MPI initialization. The master process with rank 0 collects the listener information of the others andbroadcasts them. However,globus2 does not fork another listener process asch p4 does. The channel openningrequest is not be accepted until the receiver explicitly receives the request by calling the polling function.
Figure 4 shows the structure of process group tablecommworldchannel of the ‘globus2’ device. The-thentry incommworldchannel contains the information of a channel to the process with rank. In Figure 4,
6
tcp_miproto_t
send_q_head
hostname
listner port
handlep
send_q_tail
. . .
channel 1
channel i
channel n
channel 0
commworldchannel send queue tcpsendreq
buffer
source rank
destination rank
tag
datatype
. . .. . .
. . .
Figure 4. commworldchannel structure
we abstract this structure to show only the values of our concern. The pair of and is the addressof a listener. contains the real channel information. If is null, the channel has not opened yet.Send-operation pushes a request into the send-queue and registers this request toglobus io module. Then, thepolling function is called to wait until the kernel handles the request.
There are two receive-queues in MPICH: the unexpected queue and the posted queue (Figure 5). The formercontains arrived messages whose receive-requests have not been issued yet. If a receive-operation is called, itexamines the unexpected queue first whether the message has already arrived. If the corresponding message existsin the unexpected queue, it is delivered. Otherwise the receive-request is enqueued into the posted queue. On themessage arrival, the handler checks whether there is a corresponding request in the posted queue. If it exists, themessage is copied into the requested buffer. Otherwise, the message is pushed into the unexpected queue.
rhandle
source rank
tag
context id
buffer
datatypesource rank
. . .
posted queue Rhandleunexpected queue
Figure 5. Receive-queues of MPICH
4 MPICH-GF
In the MPICH-GF system, MPI processes run under the control of hierarchical managers, a central manager andlocal managers. The central manager and the local managers are responsible for hardware or network failures andprocess failures, respectively. They are also responsible for checkpointing and automatic recovery. In this section,we present MPICH-GF’s structure and checkpointing/recovery protocol implementation.
7
DynamicProcessManagement
Checkpoint Toolkit
Message QueueManagement
ft−globus
Collective Communication
P2P Communication
Originalglobus2
ImplementationMPI
ADI
Figure 6. The structure of MPICH-GF
4.1 Structure
Figure 6 describes the MPICH-GF library structure. Fault-tolerance module has been implemented at the virtualdevice level, ‘ft-globus’ and our MPICH-GF requires no modification of upper MPI implementation layer. Theft-globus device contains dynamic process group management, checkpoint toolkit and message queue management.Most of previous works have implemented fault-tolerance module at the upper layer making abstract of com-munication primitives, while our implementations has been performed at the lower layer. By doing so, previousapproaches neglect to support the characteristic of some specific communication operations: for example, the non-blocking operation. In addition, the low level approach is inevitable to reconstruct the physical communicationchannel.
4.2 Coordinated Checkpointing
For consistent recovery, the coordinated checkpointing protocol is employed. The central manager initiatesglobal checkpointing periodically as shown in Figure 7. Then, local managers signal processes withSIGUSR1so that the processes can be ready for checkpointing. The signal handler for checkpointing has been registeredin the MPI process by the MPI initialization. On receipt ofSIGUSR1, the signal handler executes a barrier-likefunction before checkpointing. By performing the barrier, two things can be guaranteed. One is that there isno orphan message between any two processes. The other is that there is no in-transit message because barriermessages push any previously issued message into the receiver. Channel implementation ofglobus2 is based onTCP sockets and hence FIFO property is kept. Pushed messages are stored at the receiver’s queue in the user-levelmemory so that checkpoint file can include them. When processes are recovered with this global checkpoint,every in-transit messages are also restored. This technique is similar toReady Message of CoCheck. After quasi-barrier, each process generates the checkpoint file and informs the local manager that it completes checkpointingsuccessfully. The central manager checks if all the checkpoint files have been generated and confirms the collectionof checkpoint files as a new version of global checkpoint.
4.3 Consistent Recovery
In the MPICH-GF system, hierarchical managers are responsible for failure detection and automatic recovery.We have implemented hierarchical managers by modifying original globus run-time module as described in Sec-tion 3. Since application processes are forked from the local manager, the local manager receivesSIGCHLDwhen
8
Central Manager
Checkpointing
InitiationRequest for checkpointing
Signal : SIGUSR1
Confirm
Wait n replies
ProcessLocal Manager
Checkpoint completed
Checkpointing
Barrier
Figure 7. Coordinated checkpointing protocol
its forked application process terminates. Upon receiving the signal, the manager checks if the termination wasthrough normalexit() by calling the system call,waitpid(). If that is the case, the execution is successfullydone and the local manager doen not have do anything. Otherwise, the local manager regards it as a process failureand notifies the central manager. However, it is possible for the local manager to fail as well. The central managermonitors all the local managers by pinging them periodically. If a local manager does not answer, the centralmanager assumes that the local manager has failed or hardware/network has failed. Then, it re-submits the requestto GRAM module in order to restore the failed processes from the checkpoint files.
In coordinated checkpointing, a single failure results in the rollback of all the processes to the consistent globalcheckpoint. The central manager broadcasts both of a failure event and a rollback request. Our first approach torecovery was to kill all the processes on a single failure and to recreate them by submitting sub-requests to thegatekeeper on each node. This approach took too much time for gatekeeper’s authentication and reconstructionof all the channels. To improve the efficiency of recovery, we allow the survived processes not to terminate.Instead, they dump checkpoint files to their memory by callingexec(). This mechanism affects only the userlevel memory without affecting the channel status remaining at the kernel side. After dumping the memory, thesurvived processes can communicate without any channel reconstruction. However, the channel to the failedprocess requires to be reconstructed.
5 Implementation Issues
5.1 Communication Channel Reconstruction
The original MPI specification supports static process group management. Once a process group and channelsare built, this information cannot be altered during the run time. In other words, no new process can join thegroup. A recovered process instance is regarded as new instance at the view of the group. Although it can restorethe process state, it cannot communicate with the survived processes. Our proposed solution is to invalidate com-munication channels of failed process and to reconstruct them. We have implemented a new ‘MPI Rejoin()’function for this.
Before a restored process resumes the computation, it callsMPI Rejoin() in order to update its listenerinformation of the other’scommworldchannel and to re-initiate its channel information. To update the listenerport, the restored process informs the following values:
9
(3) broadcast
Process ProcessSurvived
(4) new commworldchannel info.(1) Inform new listner info.
ManagerCentral
Manager ManagerLocal Local
Recovered
(3) broadcast
(2) forward new listner info.
(4) new commworldchannel info.
Figure 8. Communication channel reconstruction protocol
MPI_function() | signal_handler()... | if ( ft_globus_mutex == 1 ) thenrequested_for_coordination = FALSE | requested_for_coordination = TRUE;ft_globus_mutex = 1; | else
| do_checkpointing();... | return;
| if (requested_for_coordination == TRUE) then |
do_checkpoint(); |endif |ft_globus_mutex = 0; |
|
Figure 9. Atomicity of message transfer
global rank: the logical process ID of the previous run.
hostname: the address of the current node where the process is restored.
port number: the new listener port number.
This information is broadcasted via hierarchical managers. The survived processes invalidate the channel to thefailed process or renew the listener information of the restored process according to the event information fromits local manager. Figure 8 describes the interaction among managers and processes for theMPI Rejoin() call.The restored process sets its handles free to re-initiate channels so that it can consider as if it has not created anychannel to the others. The procedure of channel reconstruction is the same as that of the channel creation (asdescribed in Section 3.2.)
5.2 Atomicity of Message Transfer
Messages in MPICH-G2 are sent being divided into the header and the payload. If checkpointing is performedafter receiving a header but before receiving the payload, partial message loss may happen. We provide theatomicity of message transfer in order to store and restore the communication context safely. In other words,checkpointing is not performed while the message transfer is in progress. We have made the MPI communicationoperation mutually exclusive by using the checkpoint signal handler as shown in Figure 9.
Each mutually exclusive area for send and receive operations has been implemented in different levels. Weset the whole send-operation area (MPI Send and MPI Isend) as a critical section. The process status in
10
checkpoint files should be either that any send-operation has not been called or that a message has been sentcompletely. We do not want a checkpoint file to contain send-requests in the send-queue because each send-request is related with the physical channel. If a process restores its state with send-queue entries, some of themmay not be sent correctly because of channel updates. The kernel does not process any non-blocking send-requestuntil a polling functionMPI Wait() is called explicitly in the user code. MPICH-GF replaces the non-blockingsend-operation into the blocking send-operation to ensure that no send-request exists in a checkpoint file. Thisreplacement does not affect the performance or the correctness of the process because blocking operation justsubmits its request to the kernel, but does not wait for the delivery of the message at the receiver side.
The implementation level of the receive-operation’s critical section is lower than that of the send-operation. Ifthe whole receive-operation is implemented as a critical section, the deadlock situation may happen in coordi-nated checkpointing (Figure 10.) In the figure, a sender is waiting that all the processes enter in the coordinationprocedure and a receiver is waiting for the arrival of the requested message. At the receiver side, the signal forcoordination is delayed until the receive-call finishes. The blocking receive is a combination of the non-blockingreceive and the loop of the polling function, and the actual message delivery from the kernel to the user memoryis done by polling. To prevent the deadlock, we set the polling function in loop as a critical section. The check-point file may contain receive-requests in the receive-queue, which does not matter because they do not relatedwith physical channel information. In order to match the arrived message and the receive-request, MPI’s receivemodule checks only sender’s rank and message tag. So the recovered process can receive messages correspondingto the restored receive-requests. For that reason, the non-blocking receive does not need to be replaced with theblocking one.
CheckpointingRequest for
CoordinationDealyed
P P
Waiting for
Coordination
Send()
Recv()
Figure 10. Deadlock of blocking operation in coordinated checkpointing
6 Experimental Results
In this section, we present performance results for five Nas Parallel Benchmarks applications [26] – LU, BT,CG, IS and MG – executing on a cluster-based grid system of four Intel Pentium III 800 MHz PCs with 256MBmemory connected by ordinary 100 Mbps Ethernet. Linux kernel version 2.4 and Globus toolkit version 2.2 havebeen installed. We have measured the overhead of checkpointing and the cost of recovery.
Table 1 shows the characteristics of the applications. Application IS uses all-to-all collective operations only.MG and LU use anonymous receive-operations withMPI ANY SOURCE option. However, the correspondingsender for the receive call is determined statically by message tagging. LU is the most communication intensiveapplication. However its message size is the smallest. Each process of CG has two or three neighbors and itcommunicates with only one of them for an iteration. The message size of CG is relatively large. BT is also a
11
communication intensive application. The process has six neighbors in all directions of a 3D cube. However, allof them perform non-blocking operations for every iteration.1
Application(Class)
Description CommunicationPattern
AverageMessageSize(KB)
Number of SentMessages perprocess
Executablefile size (KB)
AverageCheckpointSize (MB)
BT (A) Navier-Stokes Equation 3D Mesh 112.3 2440 1018.9 86.8CG (B) Conjungate gradient method Chain 144.9 7990 993.2 125.9IS (B) Integer Sort All-to-all 2839.2 106 901.1 154.9LU (A) LU Decomposition Mesh 3.8 31526 1054.7 18.1MG (A) Multiple Grid Cube 35.4 3046 952.3 124.2
Table 1. Characteristics of the NPB applications used in the experiments
We have measured the total execution time of NPB applications using MPICH-G2 and MPICH-GF respectively.To evalulate the checkpointing overhead, we varied the checkpoint periods from 10% to 50% of the total executiontime. With 10% checkpoint period, applications perform checkpointing 9 times while only one checkpointing isdone with 50% checkpoint period. Figure 11 (a) shows the total execution times of each case without any failureevent. To evaluate the monitoring overhead, we measured the total execution time of “MPICH-G2” and “MPICH-GF” without checkpointing respectively. In this experiment, the central manager queries local managers for theirstatus every five seconds. If a local manager does not reply, it is assumed to have failed. The difference betweenfirst two cases in Figure 11 indicates the monitoring overhead. The other 5 cases show the execution time usingMPICH-GF with 1,2,3,4,9 checkpointings performed respectively. Figure 11 (b) shows the average checkpointingoverhead measured at the process. The checkpointing overhead consists of the communication overhead, barrieroverhead and the disk overhead. We assume that the communication overhead as the network delay among thehierarchical managers in single coordinated checkpointing. We used the optionO SYNC at file open in order toguarantee that the blocking disk writes have been flushed physically, which is considerably time-consuming. Asshown in the figure, the barrier overhead is pretty small and the communication overhead is about the same for allthe applications. The disk overhead is the dominant overhead for the checkpointing and is almost proportional tothe checkpoint file size as expected. The barier overhead of IS (B Class) application is exceptionally large. It isdue to the in-transit messages of IS whose sizes are the largest among all applications (Table 1). During the barrieroperation, they are pushed to the user-level memory of receiver. Hence, it takes more time for IS to complete thebarrier operation.
Incremental checkpointing and forked checkpointing technique may be applied to minimize this disk overhead.However, we are not sure that the benefits of incremental checkpointing would be effective as expected since all theapplications used have relatively small executables (about 1MB) while using a lot of data segment. For example,we have measured the checkpoint file size of IS application with different problem sizes: class A and class B. Thecheckpoint file sizes are 41.7 MBytes and 155.8 MBytes respectively while the executable remains the same. Thedifference of checkpoint file sizes is caused by the heap and the stack and they tend to be changed frequently.
In order to measure the recovery cost, we have measured the time at the central manager from the failuredetection to the completion of the channel update broadcast. The recovery cost per single failure is presented inFigure 12. The recovery is performed as follows: failure detection, failure broadcast, authentication for the requestand process re-launching. Among them, the authentication is the most dominant factor in recovery and it is closelyrelated to the workings of the Globus middleware. The second dominant overhead is the process re-launching andmost of the time is spent in reading the checkpoint file from hard disk. We note that both significant overheadfactors are not related to the scalability issues.
1Collective operations are implemented by calling non-blocking P2P operations.
12
0
100
200
300
400
500
600
700
800
900
BT
(A)
CG
(B)
IS (B
)LU
(A)
MG
(A)
time (sec)
MP
ICH
-G2
MP
ICH
-GF
(no ckpt.)
MP
ICH
-GF
(T=
50%)
MP
ICH
-GF
(T=
33%)
MP
ICH
-GF
(T=
25%)
MP
ICH
-GF
(T=
20%)
MP
ICH
-GF
(T=
10%)
(a)
0 1 2 3 4 5 6 7 8 9
BT
(A)
CG
(B)
IS (B
)LU
(A)
MG
(A)
time (sec)
Com
munication overhead
Disk overhead
Barrier overhead
(b)
Figure
11.Failure-freeoverhead:
(a)totalexecution
time
(b)com
positionofthe
checkpointingover-
head
7C
onclusions
Inthis
paper,w
ehave
presentedthe
feasibility,architecture
andevaluation
offault-tolerant
systemfor
grid-enabled
MP
ICH
.Our
proposedsystem
minim
izesthe
lossofcom
putationofprocesses
byperiodic
checkpointingand
guaranteesthe
consistentrecovery.Itdoes
notrequireany
modification
ofapplicationsource
codesor
MP
ICH
upperlayer.
While
previousresearches
havem
odifiedthe
higherlevel
yieldingthe
performance,
MP
ICH
-GF
respectsthe
comm
unicationcharacteristic
ofM
PIC
Hby
lower
levelapproach.C
onsiderationon
comm
unicationcontext
ofthe
lower
levelm
akesfine
grainof
checkpointtim
ingpossible.
How
ever,all
ofour
jobhave
beenaccom
plishedat
theuser
level.M
PIC
H-G
Finter-operates
with
Globus
toolkitand
canrestore
afailed
processin
anynode
acrossdom
ains,only
ifG
AS
Sservice
isvalid.
We
alsohave
shown
theim
plementation
issueand
theevaluation
ofcoordinated
checkpointing.A
tthe
pointof
writing
thispaper,
implem
entationof
independentcheckpointing
with
message
loggingis
undergoing.
The
centralmanager
isexposed
toa
singlepoint
offailure.W
eplan
touse
redundancyon
thecentralm
anagerfor
thehigh
availability.O
urultim
atetarget
gridsystem
isa
pileof
clusters.Localfailures
happeninginside
acluster
shouldrather
bem
anagedin
thatcluster.A
sshow
nin
ourexperim
entalresults,the
recoverycost
containsauthentication
overheadw
hichis
time-consum
ing.S
incethe
originaltaskrequest
hasbeen
alreadyauthenticated,
recoveryrequest
may
skipthis
procedure.In
orderto
manage
thislocal
failureefficiently,
more
hierarchicalm
anagement
architectureis
required.C
urrently,w
eare
developingthe
multiple
hierarchicalm
anagersystem
byredundancy.
References
[1]A
.Agbaria
andR
.Friedm
an.Starfish:F
ault-tolerantdynamic
mpiprogram
son
clustersofw
orkstations.InP
roceedingsofIE
EE
Symposium
onH
ighPerform
anceD
istributedC
omputing,1999.
[2]L.A
lvisi,E.N
.Elnozahy,S
.Rao,S
.A.H
usain,andA
.D.M
el.A
nanalysis
ofcomm
unicationinduced
checkpointing.In
Symposium
onFault-TolerantC
omputing,pages
242–249,1999.
13
0 2 4 6 8 10 12
BT
(A)
CG
(B)
IS (B
)LU
(A)
MG
(A)
time (sec)
Process re-launching
Job re-submission
Broadcasting failure event
Figure
12.Recovery
cost
[3]L.
Alvisi
andK
.M
arzullo.M
essagelogging:
Pessim
istic,optim
istic,causal
andoptim
al.IE
EE
Transactionson
Software
Engineering,24(2):149–159,F
EB
1998.[4]
R.
Batchu,
A.
Skjellum
,Z
.C
ui,M
.B
eddhu,J.
P.
Neelam
egam,
Y.D
andass,and
M.
Apte.
MP
I/FT
:architectureand
taxonomies
forfault-tolerant,m
essage-passingm
iddleware
forperform
ance-portableparallelcom
puting.In
1stInter-nationalSym
posiumon
Cluster
Com
putingand
theG
rid,May
2001.[5]
A.B
eguelin,E.S
eligman,and
P.S
tephan.A
pplicationlevelfaulttolerance
inheterogenous
networks
ofworkstations.
JournalofParallelandD
istributedC
omputing,43(2):147–155,1997.
[6]G
.B
osilca,A
.B
outeiller,F.
Cappello,
S.
Djilali,
G.
F.M
agniette,V.
N´
eri,and
A.
Selikhov.
MP
ICH
-V:
Toward
ascalable
faulttolerantMP
Iforvolatile
nodes.InSuperC
omputing
2002,2002.[7]
G.
Burns,
R.
Daoud,
andJ.
Vaigl.
LAM
:A
nopen
clusterenvironm
entfor
mpi.
InP
roceedingof
Supercomputing
Symposium
,pages379–386,Toronto,C
anada,1994.[8]
R.B
utlerand
E.L.Lusk.
Monitors,m
essages,andclusters:
The
p4parallelprogram
ming
system.
ParallelCom
puting,20(4):547–564,1994.
[9]K
.M.C
handyand
L.Lamport.
Distributed
snapshots:D
etermining
globalstatesofdistributed
systems.
AC
MTrans-
actionson
Com
putingSystem
s,3(1):63–75,AU
G1985.
[10]Y.C
hen,K.Li,and
J.S.P
lank.C
LIP:A
checkpointingtoolfor
message-passing
parallelprograms.
InP
roceedingsof
SC97:
High
Performance
Netw
orking&
Com
puting,NO
V1997.
[11]I.B
.M.C
orporation.IB
Mloadleveler:
User’s
guide,SE
P1993.
[12]E
.N.E
lnozahy,L.Alvisi,Y.-M
.Wang,and
D.B
.Johnson.Asurvey
ofrollback-recoveryprotocols
inm
essage-passingsystem
s.AC
MC
omputing
Surveys,34(3):375–408,2002.[13]
G.
E.
Fagg
andJ.
Dongarra.
FT-M
PI:
Fault
tolerantM
PI,
supportingdynam
icapplications
ina
dynamic
world.
InP
VM
/MP
I2000,pages
346–353,2000.[14]
M.P
.I.Forum
.M
PI:a
message
passinginterface
standard,MA
Y1994.
[15]I.F
osterand
N.T.K
aronis.A
grid-enabledM
PI:M
essagepassing
inheterogeneous
distributedcom
putingsystem
s.In
Proceedings
ofSC98.A
CM
Press,1998.
[16]I.
Foster
andC
.K
esselman.
The
globusproject:
Astatus
report.In
Proceedings
ofthe
Heterogeneous
Com
putingW
orkshop,pages4–18,1998.
[17]I.F
osterandC
.Kesselm
an.T
heG
rid:B
lueprintfora
Future
Com
putingInfrastructure.
Morgan
Faufm
annP
ublishers,1999.
[18]I.F
oster,C.K
esselman,and
S.T
uecke.T
heanatom
yofthe
grid:E
nablingscalable
virtualorganizations.International
J.Supercomputer
Applications,15(3),2001.
14
[19] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A computation management agent formulti-institutional grids. InProceedings of the Tenth IEEE Symposium on High Performance Distributed Comput-ing (HPDC10), AUG 2001.
[20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. MPICH: A high-performance, portable implementation of the MPIMessage Passing Interface Standard.Parallel Computing, 22(6):789–828, 1996.
[21] R. Koo and S. Toueg. Checkpointing and rollbackrecovery for distributed systems.IEEE Transaction on SoftwareEngineering, SE-13(1):23–31, 1987.
[22] W.-J. Li and J.-J. Tsay. Checkpointing message-passing interface(MPI) parallel programs. InPacific Rim InternationalSymposium on Fault-Tolerant Systems (PRFTS), 1997.
[23] M. J. Litzkow and M. Solomon. Supporting checkpointing and process migration outside the unix kernel. InUSENIXConference Proceedings, pages 283–290, San Francisco, CA, JAN 1992.
[24] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou. Portable fault tolerance scheme for MPI.Parallel ProcessingLetters, 10(4):371–382, 2000.
[25] K. Z. Meth and W. G. Tuel. Parallel checkpoint/restart without message logging. InProceedings of the 2000 Interna-tional Workshops on Parallel Processing, 2000.
[26] NASA Ames Research Center. Nas parallel benchmarks. Technical report, http://science.nas.nasa.gov/Software/NPB/,1997.
[27] R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots.IEEE Transacsions on Paralleland Distributed Systems, 6(2):165–169, 1995.
[28] N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. InSympo-sium on Fault-Tolerant Computing, pages 58–67, 1998.
[29] G. T. Nguyen, V. D. Tran, and M. Kotocov´a. Application recovery in parallel programming environment. InEuropeanPVM/MPI, pages 234–242, 2002.
[30] A. Nguyen-Tuong.Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia,USA, 2000. Partial Fullfillment of the Requirements for the Degree Doctor of Computer Science.
[31] J. S. Plank, M. beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. InUSENIX Winter 1995Technical Conference, JAN 1995.
[32] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing.IEEE Transactions on Parallel and Distributed Systems,9(10):972–986, 1998.
[33] S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. InSymposium onFault-Tolerant Computing, pages 48–55, 1999.
[34] S. H. Russ, J. Robinson, B. K. Flachs, and B. Heckel. The hector distributed run-time environment.IEEE Transactionson Parallel and Distributed Systems, 9(11):1102–1114, NOV 1998.
[35] L. Silva and J. ao. Silva. System-level versus user-defined checkpointing. InSymposium on Reliable DistributedSystems 1998, pages 68–74, 1998.
[36] G. Stellner. CoCheck: Checkpointing and process migration for MPI. InProceedings of the International ParallelProcessing Symposium, pages 526–531, APR 1996.
[37] J. Tsai, S.-Y. Kuo, and Y.-M. Wang. Theoretical analysis for communication-induced checkpointing protocols withrollback dependency trackability.IEEE Transactions on Parallel and Distributed Systems, 9(10):963–971, 1998.
[38] V. Zandy, B. Miller, and M. Livny. Process hijacking. InEighth International Symposium on High PerformanceDistributed Computing, pages 177–184, AUG 1999.
15
Ω Line Problem in a Consistent Global Recovery Line ∗
†MaengSoon Baik †SungJin Choi ‡JoonMin Gil ‡ChanYeol Park §HyunChang Yoo †ChongSun Hwang
†Dept. of Computer Science & Engineering Korea University, South Korea
(msbak, lotieye and hwang)@disys.korea.ac.kr
‡Supercomputing Center, Korea Institute of Science and Technology Information, South Korea
(jmgil and chan)@kisti.re.kr
§Dept. of Computer Science & Education Korea University, South Korea
Abstract
Optimistic log-based rollback recovery protocols have been regarded as an attractive fault-tolerant solution in
distributed systems based on message-passing paradigm due to low overhead in failure-free time. These protocols
are based on a Piecewise Deterministic(PWD) Assumption model. They, however, assumed that all logged non-
deterministic events in a consistent global recovery line must be determinately replayed in recovery time. In this
paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent global
recovery line as a Ω Line Problem, because of asynchronous properties of distributed systems: no bound on the
relative speeds of processes, no bound on message transmission delays and no global time source. In addition,
we propose a new optimistic log-based rollback recovery protocol, which guarantees the deterministic replaying of
all logged non-deterministic events belonged in a consistent global recovery line and solves a Ω Line Problem in
recovery time.
Keywords : Distributed Systems, Fault-tolerance, Optimistic Log-based Rollback Recovery Protocol, Piecewise
Deterministic Assumption Model, Ω Line Problem.
∗This work was supported by grant No. G-03-GS-01-04X-6 from the Korea Institute of Science and Technology Information
1 Introduction
In distributed systems, long-running distributed application programs run the risk of failing completely even if
one of the processes on which they are executing should fail[1]. The use of rollback recovery protocol can be an
attractive method of providing fault-tolerance that can recover the execution of program after a failure. During
failure-free execution, state information about each process is saved on stable storage so that the process can later
be restored to certain earlier states of its execution[1, 5]. After a failure, if a consistent global system state can be
formed from individual process’s states, the system as a whole can be recovered and the resulting execution of the
system will then be equivalent to some possible failure-free execution[2, 4, 12]. Rollback recovery protocol can be
made completely transparent to the application program, avoiding special efforts by the application programmer
and allowing existing programs to easily be made fault-tolerant[2, 18].
With transparent rollback recovery protocols, they are classified into checkpoint-based rollback recovery protocol
and log-based rollback recovery protocol [1, 6]. In the checkpoint-based rollback recovery protocol, each process in
systems saves its state information in preparation for occurrence of a failure. The saved state information is called
by a checkpoint. The methods of taking checkpoint are classified into independent checkpointing and coordinated
checkpointing [2, 9, 11, 13, 16]. Independent checkpointing minimizes the overhead at failure-free execution by
independently taking checkpoint without synchronizing with other processes. On the other hand, coordinated
checkpointing forces the synchronization with other process whenever each process takes checkpoint. Log-based
rollback recovery protocols are usually classified into optimistic log-based rollback recovery protocol and pessimistic
log-based rollback recovery protocol [3, 6, 7]. In the optimistic log-based rollback recovery protocol , the determinants,
context and delivery order of message, of received messages are logged at volatile memory before processing and
logged on stable storage at adequate time of process[4, 6, 8]. On the other hand, pessimistic log-based rollback
recovery protocol enables to recover until the state before occurrence of a failure by logging determinants of all
nondeterministic events on stable storage before processing[6, 14, 17].
Log-based rollback recovery protocols are popular for building a fault-tolerant systems that can tolerate process
crash failures, since they guarantee that all logged non-deterministic events in a consistent global recovery line can
be determinately replayed beyond the recent checkpoint[2, 3]. These protocols require that each process periodically
2
records its local state and logs the messages it received after having recorded that state. When a process crashes,
a new process is created in its place. The new process is given the appropriate recorded local state and then it is
sent the logged messages in the order they were originally received[5]. Thus, log-based rollback recovery protocols
implement an abstraction of a resilient process in which the crash of a process is translated into intermittent
unavailability of that process.
All log-based rollback recovery protocols require that once a crashed process recovers, its state is consistent with
the states of other processes[6]. This consistency requirement is usually expressed in terms of orphan processes,
which are surviving processes whose states are inconsistent with the recovered state of a crashed process. Thus,
log-based rollback recovery protocols guarantee that no process is an orphan in recovery time.
In log-based rollback recovery protocol, previous optimistic log-based rollback recovery protocols assumed that all
logged non-deterministic events in a consistent global recovery line must be determinately replayed. We, however,
detect that some logged non-deterministic events in a consistent global recovery line cannot be determinately
replayed in recovery time because of asynchronous properties of distributed systems, no bound on the relative
speeds of processes, no bound on message transmission delays and no global time source. As a result, some process
may come to be an orphan process in recovery time, although they are not any orphan process in a consistent
global recovery line.
In this paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent
global recovery line in recovery time as a Ω Line Problem. In addition, we propose a new optimistic log-based
rollback recovery protocol, which guarantees the deterministic replaying of all logged non-deterministic events in a
consistent global recovery line when a failure is occurred in systems and solves a Ω Line Problem in recovery time.
The rest of this paper is organized as follows: Section 2 presents the system models assumed in this paper.
Section 3 presents the impossibility of deterministic replaying of logged non-deterministic event in a consistent
global recovery line in recovery time. Section 4 describes a new optimistic log-based rollback recovery protocol,
which guarantees the deterministic replaying of all logged non-deterministic event in consistent global recovery line
in recovery time. Finally, we conclude this paper in Section 5.
3
2 System Models
The system assumed in this paper is distributed system consisting of processes[12]. Processes communicate with
each other only by exchanging messages. The distributed system is asynchronous: there exists no bound on the
relative speeds of processes, no bound on message transmission delays and no global time source[12].
0p
1p
2p
0m
1m
2m
3m
4m
7m
1
0ps
1
1ps
1
3ps
1
2ps
0
0ps
0
1ps
0
2ps
2
0ps
2
1ps
2
2ps
6m
0
3ps
2
3ps
5m 8m
0
12 ( )pLog s m
0
24 ( )pLog s m
0
36 ( )pLog s m
1
37 ( )pUnLog s m
1
10 ( )pLog s m
1
23 ( )pLog s m
2
25 ( )pLog s m
2
38 ( )pLog s m2
11 ( )pLog s m
Figure 1. The execution of processes in systems based on PWD model
Processes in systems execute events, including message sending events, message receiving events and local
events. These events are ordered by the irreflexive partial order, happen- before relation, that represents potential
causality[1]. We assume that the channels between processes do not reorder messages and are reliable. Outside of
satisfying the happens-before relation, the exact order a process executes message receiving events depends on other
factors, including process scheduling, routing and flow control. Thus, after a recovery processing is successfully
terminated, a recovering process may not produce the same result upon recovery, even if the same set of messages
are sent to it since they may not be redelivered in the same sequence as before failure is occurred. Previous rollback
recovery protocols represent this non-determinism by creating a non-deterministic local event, message delivering
that corresponds to the delivery of a received message to be delivered to the application. The only constraints are
that a message eventually delivers all messages that it receives.
An execution of each process in systems is based on PWD model. In systems based on PWD model, an execution
4
of each process is determined by the initial state of process and the delivery order and content of received messages.
Thus, if the delivery order and content of received messages is maintained, then the execution of process is able to
be determinately replayed[2, 4]. In Figure 1, the whole execution of process p1 is determined by the initial state
and delivery orders and contents of received messages m0, m3 and m7. The formal representation of process’s
execution in systems based on PWD model is as follows[2].
Ex(pi) = Init(pi) +⊎→
j∈I sjpi
(mk)
Ex(pi) : the whole execution state of process pi
Init(pi) : the initial state of process pi
+ : sequential work flow of a process
⊎→j∈I : the set of sequential work flows of a process having ascending order j
I : state interval number
sjpi
(mk) : the j − th state interval of process pi occurred by message mk
(1)
The state interval of process is started by delivery of receiving message and terminated by delivery of newly
receiving message. In Figure 1, the state interval s1p2
of process p2 is started by delivery of receiving message m1
from process p1 and terminated by delivery of receiving message m5 from process p1. In addition, events of sending
messages in same state interval are determinately executed. In Figure 1, three events of sending messages m2, m3
and m7 are determinately executed in state interval s1p2
. The whole execution state of process is a sequential work
flow of state intervals. In Figure 1, the whole execution state of process p1 is composed of the initial state s0p1
and
the sequential set of state intervals s1p1
, s2p1
and s3p1
. The formal representation of the whole execution state of
process p1 is as follows.
Ex(p1) = s0p1] s1
p1(m0) ] s2
p1(m3) ] s3
p1(m7) (2)
Processes in systems can fail independently according to the fail-stop model. In this model, processes fail by
halting, and the fact that a process has halted is eventually detectable by all surviving processes.
5
3 Impossibility of Deterministic Replaying of Logged Non-deterministic Event in a
Consistent Global Recovery Line
3.1 Previous consistency condition in optimistic log-based rollback recovery protocol
In optimistic log-based rollback recovery protocols, previous studies assumed that the execution of process is
based on piecewise deterministic assumption model. In this model, the execution of each process is assumed to
take place as a series of state intervals in which events take place. For example, an event may be the execution of
an instruction, the sending of a message, and so on. Each state interval in the piecewise deterministic assumption
model is assumed to start with a non-deterministic event, such as the receipt of a message. However, from that
moment on, the execution of the process is completely deterministic. A state interval ends with the last event
before a non-deterministic event occurs.
In effect, a state interval can be replayed with a known result, that is, in a completely deterministic way, provided
it is replayed starting with the same non-deterministic event as before. Consequently, if rollback recovery protocol
records all non-deterministic events in such a model, it becomes possible to completely replay the entire execution
of a process in a deterministic way.
In optimistic log-based rollback recovery protocols, the main issue is making the system state to be consistent
after occurrence of a failure[4, 6, 10]. To make system state be consistent, no message must be an orphan message
and no process must be an orphan process in recovery time. When a delivery event of message is recorded in
receiver process, but a sending event of message is not recorded in sender process, this message comes to be an
orphan message and receiver process of this message comes to be an orphan process in recovery time. In log-based
rollback recovery protocol, an orphan message and an orphan process are defined as follows.
Definition 1 (Orphan Message) If ∃ m DE(m) ∈ Log(skpi
(m)) and SE(m) ∈ UnLog(slpj
(m′)) such that pi
is a receiver process and pj is a sender process in message m and UnLog(slpj
(m′)) is lost due to a failure, then
message m comes to be an orphan message, ψ(m), in recovery time.
6
Definition 2 (Orphan Process) If followed two conditions are satisfied in recovery time, then process pi comes
to be an orphan process, ψ(p).
• (Before Failure) : ∃ m DE(m) ∈ Log(skpi
(m)) and SE(m) ∈ UnLog(slpj
(m′)) such that pi is a receiver
process and pj is a sender process in message m
• (After Failure) : The UnLog(slpj
(m′)) saved at volatile memory in process pj is lost due to a failure.
In Definition 1 and Definition 2, skpi
(m) means the k-th state interval occurred by delivery event of message
m in process pi. The DE(m) indicates the delivery event of message m. The SE(m) indicates the sending event
of message m. The Log(skpi
(m)) means that the message context and delivery order, determinant, of message m
is logged on stable storage in process pi. The slpj
(m′) indicates the l-th state interval occurred by message m′
in process pj . The UnLog(slpj
(m′)) means that the determinant of message m′ is not logged on stable storage
in process pj . The UnLog(slpj
(m′)) may be lost due to a failure. In Figure 1, the message m6 is not an orphan
message because the determinant of message m3 in logged on stable storage in process p1. But, the message m8
may come to be an orphan message because the determinant of message m5 is not logged on stable storage in
process p1.
In log-based rollback recovery protocol, when a failure is occurred in any process, the set of state intervals in
each process is called by recovery line. The definition of recovery line is as follows.
Definition 3 (Recovery Line) When a failure is occurred in any process, the set of state intervals of each process
∪ℵi=0skpi
is a recovery line. (k is the state interval number of each process and ℵ is the number of all processes in
systems)
In addition, the recovery line containing an orphan message and an orphan process is called by an inconsistent
recovery line and the recovery line not containing any orphan message and orphan process is called by a consistent
global recovery line[3, 15]. Followed two definitions define an inconsistent recovery line and a consistent global
recovery line, respectively.
7
Definition 4 (Inconsistent Recovery Line) When a failure is occurred in any process, if ∃ m DE(m) ∈
Log(skpi
(m)) and SE(m) ∈ UnLog(slpj
(m′)), pi is a receiver process and pj is a sender process in message m,
skpi
(m) and slpj
(m′) ∈ ∪ℵi=0skpi
and UnLog(slpj
(m′)) saved at volatile memory in process pj is lost due to a failure,
then ∪ℵi=0skpi
is an inconsistent recovery line.
Definition 5 (Consistent Global Recovery Line) When a failure is occurred in any process, if @ m, ∀i, j
DE(m) ∈ Log(skpi
(m)) and SE(m) ∈ UnLog(slpj
(m′)) such that pi is a receiver process and pj is a sender process
in message m, skpi
(m) and slpj
(m′) ∈ ∪ℵi=0skpi
and UnLog(slpj
(m′)) saved at volatile memory in process pj is lost
due to a failure, then ∪ℵi=0skpi
is a consistent global recovery line.
In Figure 2, when a failure is occurred at state interval s3p1
and UnLogs3p1
(m7) is lost due to a failure, the set
of state intervals s3p0
, s2p1
and s2p2
comes to be a consistent global recovery line because it does not contain any
orphan message and orphan process. However, the set of state intervals s3p0
, s2p1
and s3p2
comes to be an inconsistent
recovery line because the message m8 comes to be an orphan message and process p2 comes to be an orphan process
in recovery time.
0p
1p
2p
0m
1m
2m
3m
4m
5m
1
0ps
1
1ps
1
3ps
1
2ps
0
0ps
0
1ps
0
2ps
2
0ps
2
1ps
2
2ps
6m
0
3ps
2
3ps
7m 8m
0
12 ( )pLog s m
0
24 ( )pLog s m
0
36 ( )pLog s m
1
37 ( )pUnLog s m
1
10 ( )pLog s m
1
23 ( )pLog s m
2
25 ( )pLog s m
2
38 ( )pLog s m2
11 ( )pLog s m
(0, 0, 0) (1, 1, 1) (2, 2, 1) (3, 2, 1)
(0, 0, 0) (0, 1, 0) (0, 2, 1) (0, 3, 1)
(0, 0, 0) (0, 1, 1) (0, 2, 2) (0, 3, 3)
Figure 2. Two recovery lines in optimistic log-based rollback recovery protocol
8
Previous optimistic log-based rollback recovery protocols, they all assumed that the logged non-deterministic
events in a consistent global recovery line must be determinately replayed in recovery time. In Figure 2, if a failure
is occurred at state interval s3p1
in process p1 after message m6 is sent to the process p1 and a consistent global
recovery line consists of state intervals s3p0
, s2p1
and s2p2
, then process p1 can determinately replay the sending
event of message m6 at state interval s2p1
in recovery time. However, we detect that the message m6 may not
be determinately sent at state interval s2p1
to process p0 in recovery time because of asynchronous properties
in distributed systems: no bound on the relative speeds of processes, no bound on message transmission delays
and no global time source. In next subsection, we show the impossibility of deterministic replaying of logged
non-deterministic event in a consistent global recovery line.
3.2 Impossibility of deterministic replaying of logged non-deterministic event in a consistent global recov-
ery line
In asynchronous distributed systems, there are no bounds on the relative speeds of processes, message trans-
mission delays and global time source. No bounds on the relative speeds of processes means that execution of
processes in recovery time may be different from execution before a failure. In addition, no bounds on message
transmission delays means that the receiving order of message in recovery time may be different from the receiving
order before a failure. Finally, no bounds on global time source means that concurrent events in different processes
cannot be coordinated without message exchanging. In previous optimistic log-based rollback recovery protocol,
they all assumed that message sending events in same state interval can be determinately replayed in recovery
time, if a state interval is contained in a consistent global recovery line. Thus, the message sent in a state interval
that its determinant is logged on stable storage and contained in a consistent global recovery line cannot be an
orphan message. In Figure 2, the message m6 is not an orphan message in a consistent global recovery line s3p0
,
s2p1
and s2p2
, because the determinant of state interval s2p1
is logged on stable storage in process p1 and process p0
is not an orphan process in recovery time.
To show that previous optimistic log-based rollback recovery protocol how to determine a consistent global
recovery line after occurrence of a failure, we assume that a failure is occurred in 4-th state interval s3p1
in process
p1 in Figure 2 and determinants of state intervals s1p1
and s2p1
are logged on stable storage, but the determinant
9
of state interval s3p1
not logged and lost due to a failure. In recovery time, systems determines a consistent global
recovery line as state intervals s3p0
, s2p1
and s2p2
, and processes p0, p1 and p2 determinately replays up to state
intervals s3p0
, s2p1
and s2p2
, respectively. In previous optimistic log-based rollback protocol, the powerful method
for determining a consistent global recovery line is using the transitive dependency vector. In this method, each
process piggybacks the transitive dependency vector maintained by its state interval on every sending message.
The receiver process of message updates its transitive dependency vector comparing to piggybacked transitive
dependency vector on message before delivering message. The updating procedure of transitive dependency vector
is as follows[1, 2].
Update(SN(p0), . . . , SN(pi), . . . , SN(pℵ)pi
= (Max(SN(p0)pi, . . . , SN(pi)pi
+ 1, . . . ,Max(SN(pℵ)pi, SN(pℵ)m))
(3)
When a failure is occurred, systems automatically detects it and restarts a process occurred a failure. Restarted
process determines the last logged state interval number and broadcast its failure occurrence to all processes in
systems. Each process received this broadcast message determines the last state interval consistent with the last
logged state interval of restarted process. This procedure proceeds until determining a consistent global recovery
line in systems. To determine whether recovery line is consistent or inconsistent, previous optimistic log-based
rollback recovery protocols constructs the dependency matrix. In Figure 2, the recovery line, represented by solid
line, consisted of stated intervals s3p0
, s2p1
and s2p2
presented by followed dependency matrix.
[3] 2 1
0 [2] 1
0 2 [2]
(4)
The matrix presented a consistent global recovery line must satisfy one property that all diagonal elements in
matrix are greater or equal than other elements in same column. In Figure 2, an inconsistent recovery line consisted
of state intervals s3p0
, s3p1
and s3p2
, presented by dotted line, has the dependency matrix as follows.
10
[3] 2 1
0 [2] 1
0 3 [3]
(5)
In this matrix, 3× 2 element is greater than 2× 2 element. That is, the state interval s3p2
in process p2 depends
the state interval s3p1
, but s3p1
is lost due to a failure and process p2 comes to be an orphan process.
However, we detect that deterministic message sending event in consistent global recovery line may not be
replayed in recovery time. To present this situation, we assumed that a failure is occurred in a state interval s3p1
in process p1 and the logging information for determinant of message m7 is lost due to a failure in Figure 3. In
recovery time, systems determines a consistent global recovery line as a set of state intervals s3p0
, s2p1
and s2p2
and
these state intervals consists a followed dependency matrix.
[3] 2 1
0 [2] 1
0 2 [2]
(6)
In this dependency matrix, diagonal elements are greater or equal than other elements in same column. But,
if process p1 receives the message m7 from process p2 after it sends messages m4 and m5 to processes p1 and p2,
but, it does not send the message m6 to process p0 in recovery time, then process p1 cannot determine whether it
delivers the message m7 or not. If message m7 arrived before process p1 does not send messages m4 and m5, then
process p1 sends messages m4 and m5 because process p1 knows that processes p0 and p3 are depended on state
interval s2p1
of process p1 by using dependency matrix representing a consistent global recovery line, 1× 2 element
and 3 × 3 element. However, it does not know that both state intervals s2p0
and s3p0
are depended on its state
intervals s2p1
by using only dependency matrix representing a consistent global recovery line. Thus, it may delivers
message m7 resent by process p3 in recovery time, delivery event of message m7 before sending message m6 to
process p0 by process p1 makes impossible for process p1 at state interval s2p1
to determinately send a message m6
to a process p0 in recovery time. As a result of this situation, a message m6 comes to be an orphan message and
11
a process comes to be an orphan process in a consistent global recovery line, s3p0
, s2p1
and s2p2, in recovery time.
Although non-deterministic message delivery event DE(m6 is belonged in a consistent global recovery line, there
is no guarantee that it can be determinately replayed in recovery time. To summarize this situation, we shows
three cases, when process p1 delivers message m7 in recovery time.
• A message m7 arrived at state interval s2p1
in process p1 before process p1 sends a message m4 to process p4
Process p1 delays a delivering event of message m7 until it sends messages m4 and m5 to processes p0 and
p2, respectively, because it knows that processes p0 and p2 are depended on its state interval s2p1
by using a
dependency matrix representing a consistent global recovery line consisted in state intervals s3p0
, s2p1
and s2p2
.
But, process p1 does not determine whether it delivers message m7 before sends message m6 to process p0 or
not, because it does not know that two state intervals s2p0
and s3p0
are all depended on its state interval s2p1
.
• A message m7 arrived at state interval s2p1
in process p1 after process p1 sends a message m4 to process p4
and not send a message m5 to process p2
Process p1 delays a delivering event of message m7 until it sends messages m5 to p2, because it knows that
process p2 are depended on its state interval s2p1
by using a dependency matrix representing a consistent
global recovery line consisted in state intervals s3p0
, s2p1
and s2p2
. But, process p1 does not determine whether
it delivers message m7 before sends message m6 to process p0 or not, because it does not know that two state
intervals s2p0
and s3p0
are all depended on its state interval s2p1
.
• A message m7 arrived at state interval s2p1
in process p1 after process p1 sends messages m4 and m5 to
processes p0 and p2, respectively
Process p1 does not determine whether it delivers message m7 before sends message m6 to process p0 or not,
because it does not know that two state intervals s2p0
and s3p0
are all depended on its state interval s2p1
.
In distributed systems, there is no guarantee that the order of message receiving in failure-free time is same
the order of message receiving in recovery time due to asynchronous properties of distributed systems: no bound
on the relative speeds of process, no bound on message transmission delays and no global time source. If a last
state interval of any process depends on two state intervals in same process in a consistent global recovery line
and process receives a message before process sends second message in recovery time, process does not determine
12
0p
1p
2p
0m
1m
2m
3m
4m
5m
1
0ps
1
1ps
1
3ps
1
2ps
0
0ps
0
1ps
0
2ps
2
0ps
2
1ps
2
2ps
6m
0
3ps
2
3ps
7m 8m
0
12 ( )pLog s m
0
24 ( )pLog s m
0
36 ( )pLog s m
1
37 ( )pUnLog s m
1
10 ( )pLog s m
1
23 ( )pLog s m
2
25 ( )pLog s m
2
38 ( )pLog s m2
11 ( )pLog s m
(0, 0, 0) (1, 1, 1) (2, 2, 1) (3, 2, 1)
(0, 0, 0) (0, 1, 0) (0, 2, 1) (0, 3, 1)
(0, 0, 0) (0, 1, 1) (0, 2, 2) (0, 3, 3)
7m
6m
7m7m
Figure 3. Ω Line Problem in optimistic log-based rollback recovery protocol
whether it delivers this message or not. If process delivers this message before sends second message, then regular
message comes to be an orphan message and regular process comes to be an orphan process in a consistent global
recovery line. In next subsection, we defines this situation as Ω Line Problem.
3.3 Ω Line problem in a consistent global recovery line
In previous subsection, we show the impossibility of deterministic replaying of logged non-deterministic event
in a consistent global recovery line. we define this situation as Ω Line problem. Followed definition shows Ω Line
problem.
Definition 6 (Ω Line Problem) In an occurrence of a failure and recovery time, if process pi executes followed
executions, then these executions create Ω Line Problem.
• (Occurrence of a failure):Before occurrence of a failure, process pi sends more than two messages,∑l
k=1 mk (l ≥
2), to process pj at a state interval, slastpi
, which is belonged in a consistent global recovery line, ∪ℵi=1slpi
, as
a last state interval of process pi after occurrence of a failure.
• (Recovery time): process pi receives message, mΩ, in its last state interval, slastpi
, in a consistent global
13
recovery line, ∪ℵi=1slpi
, before sends messages,∑l
k=1 mk (l ≥ 2), to process pj and a last state interval
spj (ml) is belonged in a consistent global recovery line, ∪ℵi=1slpi
.
4 New Optimistic Log-based Rollback Recovery Protocol
In this section, we propose a new optimistic log-based rollback recovery protocol, which guarantees the deter-
ministic replaying of all logged non-deterministic events in a consistent global recovery line and solves a Ω Line
Problem.
4.1 Data structure
Data structures maintained by process pi to recover at a occurrence of a failure are as follows.
• DeV kpi
= (SN(p0), SN(p1), · · · , k, SN(pi+1), · · · , SN(pℵ): Process pi maintains the transitive dependency
vector DeV kpi
in order to distinguish the state interval of other process having the dependency with its k-th
state interval[13]. Process pi piggybacks DeV kpi
on every message sent from its k − th state interval skpi
. In
addition, whenever process pi delivers a message, it updates new transitive vector DeV k+1pi
comparing its
DeV kpi
and transitive dependency vector piggybacked on message. This Updating procedure is presented in
Figure 4.
• ευpi
: Process pi piggybacks the direct dependency counting number ευpi
on every message sent to other processes
in a state interval. ευpi
means that a number of sending messages at a state interval to same process.
• ICNpi : process pi maintains the variables ICNpi , which means the number of occurrence a failure in systems.
4.2 Algorithm
A new optimistic log-based rollback recovery protocol proposed in this paper is consisted of two algorithms,
algorithm in failure-free time and algorithm in recovery time.
14
4.2.1 Algorithm in failure-free time
In failure-free, process pi piggybacks transitive dependency vector on every sending message. Whenever process pi
delivers some messages from other process, it updates its transitive dependency vector comparing to piggybacked
transitive vector on received message before it delivers. If some state intervals in process pj depends on a state
interval in process pi, then process pi piggybacks variable, ευpi
, on messages sent at its a state interval. The variable
ευpi
increases as the number of messages sent at a state interval in process pi. If state interval is advanced to another
state interval, then variable ευpi
is reset to 1. In failure-free time, a algorithm executed by process pi is presented
in figure 4.
——————————————————————————————————————————————————Dofailure-free execution of process pi;set DeV 0
pi= (0, 0, · · · , 0);
Whenever process pi receives a message mr from process pj , (i > jorj > i), at its state interval skpi
;update DeV k
pibefore deliver message mr;
DeV k+1pi
= (Max(SN(p0)pi , SN(p0)mr ), · · · , k + 1, · · · ,Max(SN(pℵ)pi , SN(pℵ)mr ));translate sk
piinto sk+1
pi;
Whenever process pi sends message mθ to other process pj at state interval skpi
;piggyback DeV k
pion message mθ;
if ∃mθ−1 sent from state interval skpi
to process pj then;ευpi
(pj) ++;piggyback ευ
pi(pj) on message mθ;
if skpi
transit to sk+1pi
then;ευpi
(pj) = 1;fi;
fi;oD;——————————————————————————————————————————————————
Figure 4. Algorithm in failure-free time in process pi
4.2.2 Algorithm in recovery time
If a failure is occurred in any process, then systems automatically detects it and starts with all processes a recovery
algorithm. First, process restarted by systems due to a failure determines a index number, ζ, of a state interval
lastly logged on stable storage by failed process. Failed process broadcasts the failure-occurrence message containing
this index number, ζ to all processes in systems. When other processes receives this broadcasting message, mζ ,
they determine their maximum transitive dependency vector consistent with state interval sζpfailed
and piggyback
15
this transitive dependency vector on reply message for broadcasting message. After failed process receives reply
message piggybacking sending process’s maximum transitive dependency vector consistent with sζpfailed
from all
processes in systems, it makes a dependency matrix and determines whether a set of state intervals consisting
dependency matrix is consistent or not. If a dependency matrix is not consistent and does not makes a consistent
global recovery line, failed process requests a previous transitive dependency vector to process, which comes to be
an orphan process in current recovery line until it determines a consistent global recovery line. After restarted
process determines a consistent global recovery line, it broadcast replay message containing last state interval of
each process in a consistent global recovery line. When each process replays events until a consistent global event,
it piggybacks increased ICN number on every sending message. In replaying execution, each process compares its
ICN number with piggybacked ICN number on receiving message after a failure. If piggybacked ICN number
on message is lower than its ICN number, process discards the sending message, otherwise it delivers messages
according to a consistent global recovery line. Algorithm of process pi in recovery time is presented in Figure 5.
4.3 Proof of Correctness
The correctness of proposed optimistic log-based rollback recovery protocol is proved through Theorem 1.
Theorem 1 Although ∃, ω ≥ 0 such that a state interval skpi
of process pi is belonged in a consistent global
recovery line, process pi sends messages from mψ to mψ+ω to process pj and a sate interval sηpj
(mψ+ω) of process
pj is belonged in a consistent global recovery line, proposed new recovery protocol, Υ, guarantees the deterministic
replaying of non-deterministic event in a consistent global recovery line and solves Ω Line Problem.
Proof To prove Theorem 1, we consider three cases in recvoery time and proposed recovery protocol, Υ always
solves the Ω Line Problem.
Case1: ω = 0 - If ω is equal to 0, there is no Ω Line Problem in a consistent global recovery line. In this case,
proposed protocol ,Υ, is same previous optimistic log-based rollback recovery protocol. After an occurrence
of a failure, Υ constructs a transitive dependency matrix, MTX(dep), and determines a consistent global
recovery line. Each process in systems determinately replays a non-deterministic events until a consistent
16
——————————————————————————————————————————————————When a failure is occurred in process pi;System restarts a failed process pi;Dodetermine the last state interval index number, ζ logged on stable storage in process pi before a failure;broadcast failure-message(pi, ζ) to all processes in systems;wait reply-message(pj , DeV λ
pj, ευ
pi(pj)) from all processes in systems;
continue;if process pi receives reply-message from all process then
make transitive dependence matrix MTX(dep) from∑ℵ
k=0 DeVpksuch as
a0,0(p0) a0,1(p0) · · · a0,ℵ−1(p0)a1,0(p1) a1,1(p1) · · · a1,ℵ−1(p1)
......
......
aℵ−1,0(pℵ−1) aℵ−1,1(pℵ−1) · · · aℵ−1,ℵ−1(pℵ−1)
;
if∀i (i = 0, 1, · · · ,ℵ − 1) ai,i ≥ ai,k in MTX(dep) thendetermine MTX(dep) is a consistent global recovery line;
fi;else send request-message(previous transitive dependency vector) to processes having
inconsistent relation with other processes;until process pi constructs transitive dependency vector representing a consistent global recovery line;oD;
In replaying non-deterministic events in a consistent global recovery line;Do;if ∃j, ευ
pi(pj) > 1 in last state interval belonged in a consistent global recovery line then
delay process pi deliver new message mnew;until process pi sends ευ
pi(pj)-th messages to process pj ;
fi;oD;——————————————————————————————————————————————————
Figure 5. Algorithm in recovery time in process pi
global recovery line. Because there is no Ω Line Problem in a consistent global recovery line, all non-
deterministic event in a consistent global recovery line can be determinately replayed.
Case2: ω = 1 - If ω is equal to 1, there is one message, mψ+1. When process pi replays its last state interval in
a consistent global recovery line, it knows that it must two messages mψ and mψ+1 to process pj before it
delivers any messages in recovery time because of ευpi
(pj) = 2. Thus, Υ guarantees that all non-deterministic
event in a consistent global recovery line are must be determinately replayed, although there is Ω Line
Problem.
Case3: ω = k, (k > 2) - If ω is equal to k > 2, there is k-th messages, from mψ to mψ+k. When process pi replays
its last state interval in a consistent global recovery line, it knows that it must k-th messages from mψ and
mψ+k to process pj before it delivers any messages in recovery time because of ευpi
(pj) = k + 1. Thus, Υ
guarantees that all non-deterministic event in a consistent global recovery line are must be determinately
17
replayed, although there is Ω Line Problem.
Consequently, Υ always guarantees that all non-deterministic events in a consistent recovery line must be deter-
minately replayed and solves Ω Line Problem in recovery time ¥
5 Conclusion
In this papers, we show that some non-deterministic events in a consistent global recovery line cannot be
determinately replayed in recovery time as a Ω Line Problem. In addition, we propose a new optimistic log-based
rollback recovery protocol and prove that it always guarantees that all non-deterministic events in a consistent
global recovery line must be determinately replayed in recovery time and solves Ω Line Problem.
In future, we will shows that Ω Line Problem is also occurred in a pessimistic log-based rollback recovery
protocol and implement proposed new optimistic log-based rollback recovery protocol in a widely distributed
systems, KOREA@Home Project.
References
[1] E. N. Elnozahy, D. B. Johnson and Y. M. Wang, ”A Survey of Rollbalck-Recovery Protocols in Message Passing
Systems,” CMU Technical Report CMU-CS-99-148, June. 1999.
[2] MaengSoon Baik , JinGon Shon , Kibom Kim , JinHo Ahn , ChongSun Hwang, ”An Efficient Coordinated heckpointing
Scheme Based on PWD Model”, Lecture Notes in Computer Science, Vol 2344, pp 780-790, 2002.
[3] E. N. Elnozahy and W. Zwaenepoel, ”On the use and implementation of message logging,” In Proceedings of the 24-th
International Symposium on Fault-Tolerant Computing, pp. 298-307, Jun. 1994.
[4] R. E. Strom and S. Yemini, ”Optimistic recovery in distributed systems,” ACM Trans. on Computer Systemss, vol. 3,
No. 3, pp. 204-226, Aug. 1985.
[5] L. Alvisi, ”Understanding the message logging paradigm for mssking process crashes,” Ph. D. Thesis, Department of
Computer Science, Comell University, Jan. 1996.
[6] L. Alvisi and K. Marzullo, ”Message Logging: Pessimistic, Optimistic, Causal and Optimal,” IEEE Trans. on Software
Engineering, vol. 24, pp. 149-159, Feb. 1998.
18
[7] S. Rao, L. Alvisi and H. M. Vin, ”The cost of recovery in message logging protocols,” In Proceedings of the 17-th IEEE
Symposium on Reliable Distributed Systems(SRDS), pp. 10-18, 1998.
[8] D. B. Johnson and W. Zwaenepoel, ”Recovery in distributed systems using optimistic message logging and checkpoint-
ing,” Journal of Algorithms, vol. 11, pp. 462-491, Sept. 1990.
[9] E. N. Elnozahy, D. B. Johnson and W. Zwaenepoel, ”The Performance of Consistent Checkpointing,” Proc. 11th Symp.
on Reliable Distributed Systems, pp. 39-47, Oct. 1992.
[10] D. B. Johnson and W. Zwaenepoel, ”Sender-based message logging,” In Proceedings of the 17th International Symposium
on Fault=Tolerant Computing, pp. 14-19, June. 1987.
[11] D. B. Johnson, ”Distributed system fault tolerance using message logging and checkpointing,” Ph. D. Thesis, Rice
University, Dec. 1989.
[12] K. M. Chandy and L. Lamport, ”Distributed snapshots: Determining global states of distributed systems,” ACM
Trans. on Computer Systems, vol. 3, No. 1, pp. 63-75, Feb. 1985.
[13] R. Koo and S. Toueg, ”Checkpointing and rollback-recovery for distributed systems,” IEEE Trans. on Software Engi-
neering, vol. SE-13, No. 1, pp. 23-31, Jan. 1987.
[14] G. Cao and M. Singhal, ”On Coordinated Checkpointing in Distributed Systems,” IEEE Trans. on Parallel and Dis-
tributed Systems, vol. 9, No. 12, pp. 1213-1225, Dec. 1998.
[15] H. J. Michel, R. H. Netzer and M. Raynal, ”Consistency Issues in Distributed Checkpoints,” IEEE Trans. on Software
Engineering, vol. 25, No. 2, pp. 274-281, March/April. 1999.
[16] Nitin H. Vaidya, ”Staggered Consistent Checkpointing,” IEEE Trans. on Parallel and Distributed Systems, vol. 10, No.
7, pp. 694-702, July. 1999.
[17] J. Fowler and W. Zwaenepoel, ”Causal Distributed Breakpoints,” Proc. Int’l Conf. Distributed Computing Systems,
pp. 134-141, May. 1990.
[18] Y. M. Wang, Y. Huang and W. K. Fuchs, ”Progressive retry for software failure recovery in message passing applica-
tions,” IEEE Trans. on Computers, Vol. 46(10), pp 1137-1141, Oct. 1997.
19
A Lightweight P2P Extension to ICP
Ping-Jer Yeh1, Yu-Chen Chuang2, and Shyan-Ming Yuan1
1 Department of Computer and Information Science, National Chiao Tung University,
Hsinchu, 300 Taiwan pjyeh, [email protected]
2 FAB2 FA Department, Macronix International Co., Ltd,
Hsinchu 300, Taiwan [email protected]
Abstract. Traditional Web cache servers based on HTTP and ICP infrastructure
tend to have higher hardware and management cost, have difficulty in failover
service, lack automatic and dynamic reconfiguration, and may have slow links
to some users. We find that the peer-to-peer architecture can help solve these
problems. The peer cache service (PCS) leverages each peer’s local cache,
similar access patterns, fully distributed coordination, and fast communication
channels to enhance response time, scale of cacheable objects, and availability.
Incorporating strategies such as making the protocol lightweight and backward
compatible with existing cache infrastructure, undertaking dynamic three-level
caching, exchanging cache information, and supporting mobile devices further
improve the effectiveness.
Keywords: peer-to-peer, ICP, cache, mobile devices, distributed.
1 Introduction
There are several problems with traditional Web cache service based on HTTP
(Hypertext Transfer Protocol) and ICP (Internet Cache Protocol) infrastructure. This
section identifies what we are trying to solve, and then outlines the goals we are trying
to achieve.
1.1 Problems of Traditional Web Cache Service
Traditional Web cache service follows a client-server architecture. The association
between a Web client and a number of potential cache servers can be mandatory (i.e.,
a single pre-defined proxy server handles all URL requests) or configurable by proxy
auto-config mechanism [1] (i.e., different proxies handle different regions of URL
requests). On the other hand, the group of cache servers can be organized in two
cooperative ways: parent-child hierarchy or sibling mesh [2], and they communicate
by inter-cache protocols such as ICP [3][4].
There are several problems with this architecture, however. First, to improve
locality, an organization tends to setup a child cache server locally under the
assumption that users within the same organization have more chance to generate
recurrent usage patterns. But a problem then arises: cache servers tend to be the
bottleneck in both CPU load and storage space, bringing in higher hardware and
management cost for small offices or organizations.
The second problem is that cache servers tend to be the single point of failure. The
issue of failover can be investigated from two angles: the server part and the client
part. From the server’s standpoint, cluster technology could help, but it requires
additional system-level supports and management skills. From the client’s standpoint,
a more dynamic proxy selection scheme could help. As for current proxy auto-config
mechanism, however, the list of proxy candidates that a Web client can consult is
fixed in advance unless the setting is explicitly re-configured and reloaded again and
again. Therefore, it cannot provide the failover feature gracefully.
The lack of dynamic proxy selection scheme also raises another problem in a
mobile environment. When a mobile device (such as notebooks, PDAs, and cell
phones) roams to a different place, it has to re-configure its proxy setting accordingly.
Such a priori knowledge may be unknown, however, since neither HTTP nor ICP can
dynamically query for proxies, like DHCP does for DNS servers.
The last but not the least problem is that the connection to accessible cache servers
may be too slow. A prevailing approach is to setup a new proxy within the
organization, but it introduces the first 2 problems as mentioned before.
1.2 Inspiration and Design Goals
Peer-to-peer (P2P) technology has potential for solving these problems. The basic
idea is to leverage local cache in each Web client to form a common cache space
suitable for small offices or organization scale. In this paradigm, each Web client acts
as a peer that shares its local storage with other peers, forming a larger cache space. In
this way, hardware cost is reduced since both CPU load and storage space are fully
distributed among peers. Management cost for computers and mobile devices is also
reduced since no a priori knowledge about topology of cache service is required.
Availability can be improved since the responsibility for cache service is also fully
distributed. Connection of cache service can also be improved if most peers are in the
same subnet.
P2P introduces new challenges, however. First, no central coordination server
exists in a fully distributed P2P environment. Everything is spontaneous, and
collaboration among peers is harder. If done poorly, traffic on the subnet may increase
and performance may decrease. Second, the effectiveness of cache service relies
heavily on cache information stored in peers. The total number of cacheable objects
may drop down if peers become too homogeneous or isolated. Third, topology and
boundary of peers may be unpredictable, so does the query time for cacheable objects.
If the range of peers become too large, the overall performance may be worse than
traditional servers. Fourth, there has been much investment in traditional cache system.
How can the new P2P cache system coexist or even cooperate with traditional ones?
Existing P2P cache projects, with various focus and goals, have not addressed all
these issues. To name a few, Squirrel [5] focuses on locating and routing, Tuxedu [6]
focuses on transcoding and edge services, but they are incompatible with existing ICP
infrastructure. BuddyWeb [7] requires a central server (called the LIGLO server) and
agent-based infrastructure for registration and tracking. Other systems, such as
FlashSurfer [8], seem to ignore these issues at all.
Therefore, additional requirements have to be fulfilled in order to adopt the P2P
approach to Web cache service. First, the protocol should be lightweight in both
packet size and communication frequency. Second, cache information in each peer
should be exchanged effectively so as to increase the number of cacheable objects and
hit ratio. Third, the peers’ boundary should be limited to a reasonable size. Fourth, the
architecture should be as compatible as possible with current Web infrastructure built
upon HTTP and ICP, so as to leverage existing resource.
The remaining paper is organized as follows. Section 2 outlines global design
strategies for caching. Section 3 discusses the system architecture and protocol.
Section 4 discusses implementation issues, and Section 5 gives a conclusion.
2 PCS Design Strategies
This section gives an overview of environment settings of the peer cache service
(PCS), and then outlines global strategic decisions for caching, topology, and protocol
design.
2.1 Roles and Environment Setting
Our PCS is designed for small offices or organization scale. In such an environment,
each computer acts as a PCS peer, sharing its own local storage with others to form a
common cache space. The roles of peers can be classified according to their relative
location. Take Fig. 1 for example, if we are using the computer A, then
• The local peer (LP) is A.
• Neighbor peers (NPs) are peers within the same subnet, i.e., B and C.
Communication between the LP and NPs is therefore fast since they reside in the
same broadcast boundary.
• Friend peers (FPs): peers outside of the subnet.
It should be noticed that the division of roles is not mandatory but relative. If the focal
peer under consideration is B, for example, then B is the LP. Now A and C are all B’s
NPs. Note also that traditional ICP-based cache servers can co-exist in the PCS
environment, too.
In each peer there runs a PCS server, which is responsible for PCS management
and communication issues. A PCS server can serve multiple browsers and users in the
same computer.
2.2 Lightweight Protocol
A protocol is needed in the new PCS environment to query cacheable objects and to
exchange cache information among peers, in contrast to traditional Web cache service.
The protocol should be lightweight so as to reduce traffic on the subnet. It should also
be as compatible as possible with traditional cache protocols so as to leverage existing
cache infrastructure.
It will be useful to make a distinction between two kinds of communication: the
communication between Web browsers and the PCS server in the same computer, and
the communication among different peers in different computers. The former should
remain HTTP so as to eliminate the need for modifying existing Web browsers. As for
the latter, we design the ICPoP (ICP over P2P) as a backward compatible extension to
ICP for several reasons. The first reason is that ICP is the most popular inter-cache
protocol. Maintaining backward compatibility allows us to leverage a large number of
resources. The second reason is that it is relatively lightweight compared with other
inter-cache protocols. An ICP message has a 20-byte header a variable sized payload.
The payload is usually an URL, whose average length is 55 bytes. It is therefore more
lightweight than other protocols (e.g., the length of an HTCP message excluding the
payload is 540 bytes [9]). In addition, ICP messages are sent over UDP, requiring less
resource.
2.3 Support for Mobile Devices
The support for mobile devices in PCS is twofold. First, mobile devices should be
subnet1
subnet2
subnet3
PCS server
browser1
browser3
browser2
Peer
NP1
NP2
LP
FP1
FP2
FP3
FP4
LP: local peer
NP: neighbor peer
FP: friend peer
A
CB
ICP
server
Fig. 1. A typical PCS environment
able to operate free from configuration since everything is spontaneously configured
in the PCS environment. It is such plug-and-play operation that facilitates roaming.
Second, the communication should be infrequent so as to reduce power consumption.
As a consequence, we do not adopt complex routing strategies as Squirrel [5] and
Tuxedo [6] have done. To eliminate the need to discover peers and cacheable objects
by trial and error, ICPoP provides a broadcast mode so that a mobile device needs
only issue a single query, and then some friendly NPs will stand out spontaneously.
2.4 Dynamic Three-Level Caching
To limit the boundary of peers in a fully distributed P2P environment, we adopt a
three-level caching scheme as shown in Fig. 2. The level-0 peer is the LP itself. Level-
1 peers are all NPs within the same subnet. Level-2 peers are FPs outside the subnet
boundary. We do not consider more levels because it is usually more efficient to
connect directly to origin servers than to consult level-3 or higher peers.
When the LP wants to query for a cacheable object in the PCS environment, it
broadcasts a level-1 ICPoP query to all NPs. If an NP itself cannot find a cache hit, it
replies with a list of FP candidates that may have a cache hit, so that the LP has a
second chance to perform a level-2 caching. The list of level-2 FPs may be
recommended according to various criteria such as latency and expiration state.
Therefore, the three-level cache tree is dynamic and usually unbalanced; level-0 and
level-1 nodes are usually fixed unless some computers ever start or shutdown in the
middle, while level-2 nodes are usually changing.
2.5 Exchange for Cache Information
The effectiveness of cache service relies heavily on cache information stored in peers.
If cache information is too homogeneous or isolated, the space of cacheable objects is
too restricted. If too heterogeneous, however, locality is too low to benefit from
caching. A balance is therefore required.
level 0 level 1 level 2
NP1
LP
NP2
NP3
FP1
FP2
FP3
subnet
Fig. 2. Dynamic three-level peer caching
Peers within the same subnet tend to have recurrent usage patterns, and they can
communicate by broadcasting. Therefore the LP and NPs do not need to exchange
cache information massively and frequently. On the other hand, the LP and FPs do
need to exchange cache information so as to increase diversity and thus the number of
known level-2 peer candidates.
We therefore design a strategy to exchange cache information by peer groups. As
Fig. 3 shows, in a subnet, each peer tries to configure itself with a group ID as
different as possible from other NPs. Then each peer with the same group ID
exchanges cache information periodically by multicasting. In this way we have a
certain degree of diversity and do not bring in too much network traffic.
3 Architecture and Protocol
The overall architecture of our PCS implementation is shown in Fig. 4. The rectangles
inside the box of PCS server are high-level components. The letters (a)-(h) are control
flows between components. The Roman numerals (i)-(x) are network messages.
Broken lines represent HTTP messages, while thick lines represent ICPoP messages.
This section discusses each component and protocol in detail. To save space, letters
(a)-(h) and Roman numerals (i)-(x) in this section all refer implicitly to those in Fig. 4
unless mentioned explicitly.
3.1 HTTP Manager
The HTTP manager is responsible for handling HTTP requests and replies related to
browsers in LP and PCS servers in NPs or FPs. Mobile devices are also welcome. On
receiving an HTTP request (i), it performs the following steps in sequence:
1. It asks the cache manager via (a) to see whether the requested object is in the local
cache.
subnet1
subnet2
subnet3
group ID: 1234
Fig. 3. Grouping strategy in the PCS environment
2. Otherwise, it asks the P2P manager via (b) to consult NPs and FPs to see who has
the requested object.
3. Otherwise, it connects directly to origin servers or traditional cache servers by
HTTP.
Finally, it asks the cache manager via (a) to cache the object and then replies the
object (ii). If the HTTP manager finds via (b) that some NPs or FPs may have the
requested object, it sends request messages (iii) to them and waits for their replies (iv).
3.2 ICPoP Manager
The ICPoP manager is responsible for handling ICPoP requests and replies related to
NPs or FPs. Three kinds of ICPoP messages are involved here: query for cacheable
objects, exchange for cache information, and query for mobility support.
Web browser or
other PCS server
other PCS server
HTTP manager
cache
manager
P2P
manager
local
cache
ICPoP manager
other PCS server
other PCS server
resource
manager
resource
table
(a) (b)
(c)
(d) (e)
(f)
(g)
(h)HTTP message:
ICPoP message:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
PCS server internals
Web(ix)
(x)
Control flow:
Fig. 4. Architecture of a PCS server
Query for Cacheable Objects. On receiving an ICPoP request (v), the ICPoP
manager performs the following steps in sequence:
1. It asks the cache manager via (d) to see whether the requested object is in the local
cache.
2. Otherwise, it asks the resource manager via (f) to recommend a list of FP
candidates that may have the requested object.
Finally, it replies a message (vi) with the cache hints.
If the ICPoP manager is asked by the P2P manager via (e) to consult whether level-
1 NPs have the requested object, it broadcasts request messages (vii) to level-1 NPs
and waits for their replies (viii). Finally, it replies to the P2P manager via (e) a list of
NPs and level-2 FP candidates.
If the ICPoP manager is asked by the P2P manager via (e) to consult whether a list
of given level-2 FP candidates have the requested object, it unicasts request messages
(vii) to them and waits for their replies (viii). Finally, it replies to the P2P manager via
(e) a list of FPs having cache hits.
Exchange for Cache Information. The ICPoP manager periodically exchanges cache
information with peers with the same group ID. In the beginning it broadcasts a
message (vii) to all NPs in the same subnet asking for their group IDs. And then it
selects from replies (viii) a new or least duplicate group ID for itself.
If the ICPoP manager is about to share its cache information, it multicasts a
message (vii) to peers with the same group ID. Then it forwards the replies (viii) to
the resource manager via (f).
On receiving an ICPoP request for sharing (v), the ICPoP manager asks both the
cache manager via (d) and the resource manager via (f) to recommend a number of
cache information. Finally, it replies a message (vi) with such information.
Query for Mobility Support. If the LP is a mobile device and wants to find someone
kindly enough to provide a cache service when roaming to a new place, the ICPoP
manager broadcasts a query message (vii) to all NPs. Hopefully there are some nice
peers showing their generosity in the reply messages (viii). Then the LP can send
HTTP messages (i) to one of these nice peers.
On receiving an ICPoP request for mobility support (v), the ICPoP manager may
reply with a positive or negative answer (vi).
3.3 P2P Manager
The roles of the P2P manager are twofold. It tries to fulfill the HTTP manager’s
asking via (b) for remotely cached objects, and then asks the ICPoP manager via (e) to
perform three-level caching.
On receiving the HTTP manager’s request, the P2P manager first tries to ask the
ICPoP manager to perform level-1 caching. If no cache hit is found, it asks further to
perform level-2 caching. In this way the P2P manager maintains a dynamic (usually
unbalanced) tree of NPs and FPs.
3.4 Cache Manager
The roles of the cache manager are twofold. It tries to fulfill the HTTP manager’s
asking via (a) for retrieving or storing locally cached objects, and to fulfill the ICPoP
manager’s asking via (d) for the retrieval of them.
3.5 Resource Manager
The resource manager is responsible for maintaining dynamically exchanged cache
information. It is invoked by the ICPoP manager via (f). The format of the resource
table is listed in Table 1.
3.6 ICPoP
The ICPoP (ICP over P2P) is a backward compatible extension to ICP. The opcodes
supported are listed in Table 2. Briefly speaking, the extension is twofold. Some
opcodes have payloads carrying additional data. In addition, some opcodes are added
to perform three-level caching and to exchange cache information.
ICP_QUERY (1): Query for cacheable objects. It is broadcasted to level-1 NPs or
unicasted to level-2 FPs by the ICPoP manager (vii). Possible replies (viii) are:
• ICP_HIT (2): the payload may carry additional cache information for exchange.
Table 1. An example of entries in the resource table
No Resource Peer’s address Time
1 http://www.nctu.edu.tw/img/txt.jpg 140.113.23.111 20020328 140000
2 http://www.nctu.edu.tw/img/openhouse.jpg 140.113.23.204 20020329 150252
Table 2. Opcodes supported in ICPoP. Note that the names
retain an “ICP_” prefix for compatibility purpose
Type Opcode name Value Description Comment
Query ICP_QUERY 1 Query for cacheable objects Unmodified
Reply ICP_HIT 2 Cache exists Extended
Reply ICP_MISS 3 Cache does not exist Extended
Query ICP_GROUP_QUERY 25 Query for group ID New
Query ICP_RT_QUERY 26 Query for resource table New
Query ICP_RELAY_QUERY 27 Query for mobility support New
Reply ICP_GROUP_REPLY 28 Reply for group ID New
Reply ICP_RT_REPLY 29 Reply for resource table New
Reply ICP_RELAY_REPLY 30 Reply for mobility support New
ICP_QUERY (1): Query for cacheable objects. It is broadcasted to level-1 NPs or
unicasted to level-2 FPs by the ICPoP manager (vii). Possible replies (viii) are:
• ICP_HIT (2): the payload may carry additional cache information for exchange.
• ICP_MISS (3): the payload may carry a list of level-2 FP candidates.
ICP_GROUP_QUERY (25): Query for NP’s group ID. It is broadcasted to NPs in
the same subnet by the ICPoP manager (vii). The reply (viii) is:
• ICP_GROUP_REPLY (28): the payload carries another NP’s group ID.
ICP_RT_QUERY (26): Query for entries in the resource table. It is multicasted to
the peers with the same group ID by the ICPoP manager (vii). The reply (viii) is:
• ICP_RT_REPLY (29): the payload carries a number of cache information held by
another peer.
ICP_RELAY_QUERY (27): Query for mobility support. It is broadcasted to the
NPs in the same subnet by the ICPoP manager (vii). The reply (viii) is:
• ICP_RELAY_REPLY (30): the payload carries a positive or negative answer.
4 Implementation Issues and Conclusion
We implement a pilot PCS server on the basis of the Java HTTP Proxy Server [10],
which is simple enough to experiment our ideas. All major components in the PCS
server are implemented in worker thread pools.
The performance of PCS is sensitive to various factors. In our early experience, the
following settings perform well in a small office:
• We find that a PCS caching scheme of level-3 or higher is unnecessary, if not
worse. If there seems a need for higher levels, a traditional cache server in a
backbone is usually a better choice.
• The maximum number of group IDs is 8.
• An ICP_HIT message may carry as many as 10 entries of cache information from
the cache manager and the resource manager, respectively.
• An ICP_MISS message may carry as many as 10 entries of level-2 FP candidates.
• An ICP_RT_REPLY message may carry as many as 50 entries of cache
information from the cache manager and the resource manager, respectively.
• The resource table may carry as many as 10,000 entries.
They are not optimal, however. A thorough evaluation is necessary to discover an
empirically satisfactory result.
5 Conclusion
This paper has presented a new approach to Web cache service. The peer cache
service (PCS) solves the problems in traditional client-server caching architecture
such as hardware and management cost, mobility, and speed. Incorporating strategies
such as making the protocol lightweight and backward compatible with existing cache
infrastructure, undertaking dynamic three-level caching, exchanging cache
information, and supporting mobile devices further improve the effectiveness.
The peer cache service is designed for users in small offices or organization. We
are inquiring into some empirical issues to try to enhance its performance and scale.
Acknowledgments
This work was partially supported by Ministry of Education grant 89-E-FA04-1-4:
high confidence information systems, and by National Science Council grant NSC91-
2213-E009-044: system software on handheld devices.
References
1. Netscape Communications Corporation, “Navigator Proxy Auto-Config File Format”,
Technical Report. URL: http://wp.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-
live.html (1996)
2. S. Dykes and K. Robbins, “A Viability Analysis of Cooperative Proxy Caching”,
Proceedings of the 21th Annual Joint Conference of the IEEE Computer and
Communications Societies (INFOCOM 2001), vol. 3, pp. 1205-1214, Anchorage, Alaska,
USA, April 2001.
3. D. Wessels and K. Claffy, “Internet Cache Protocol (ICP), Version 2”, RFC 2186, Network
Working Group, The Internet Society, September 1997.
4. D. Wessels and K. Claffy, “Application of Internet Cache Protocol (ICP), Version 2”, RFC
2187, Network Working Group, The Internet Society, September 1997.
5. S. Iyer, A. Rowstron, and P. Druschel, “Squirrel: A Decentralized Peer-to-Peer Web
Cache”, Proceedings of the 21th ACM Symposium on Principles of Distributed Computing
(PODC 2002), pp. 213-222, Monterey, California, July 2002.
6. W. Shi, K. Shah, Y. Mao, and V. Chaudhary, “Tuxedo: A Peer-to-Peer Caching System”,
Proceedings of 2003 International Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA’03), Las Vegas, Nevada, June 2003.
7. X. Wang, W. Ng, B. Ooi, K-L. Tan, and A. Zhou, “BuddyWeb: A P2P-based Collaborative
Web Caching System”, Peer to Peer Computing Workshop (Networking 2002), Pisa, Italy,
May 2002.
8. APA Network Technologies Ltd, FlashSurfer. URL: http://www.aranetwork.com/prod_
flashsurfer.html
9. P. Vixie and D. Wessels, “Hyper Text Caching Protocol (HTCP/0.0),” RFC 2756, Network
Working Group, The Internet Society, 2000.
10. S. Hsueh, Java HTTP Proxy Server, Version 1.0. URL: http://www.geocities.com/
SiliconValley/Bay/8925/jproxy.html (1998)
1
Comparative Analysis of the Prototype and
the Simulator of a New Load Balancing
Algorithm for Heterogeneous Computing
Environments
RODRIGO F. DE MELLO1, LUIS C. TREVELIN
2, MARIA STELA V. DE PAIVA1
1 Engineering School of São Carlos, Department of Electrical Engineering, University of São Paulo
R. Carlos Botelho, São Carlos, Brazil
[email protected], [email protected] 2 Computing Department, Federal University of São Carlos
Rod. Washington Luís, Km 235, São Carlos, Brazil
Abstract - The availability of low cost hardware has allowed the development of distributed
systems. The allocation of resources in these systems may be optimized through the use of a load
balancing algorithm. The load balancing algorithms make the occupation of a certain environment
homogeneous, with the aim of obtaining gains on the final performance. This paper presents and
analyzes a new load balancing algorithm that is based on the tree format hierarchy of computers.
This analysis is made through the results obtained by a simulator and a prototype of this algorithm.
The algorithm simulations show a significant performance gain, by lowering the response times
and the number of messages that pass through the communication means to perform the load
balancing operations. After the good results were obtained on the simulations a prototype was built
and it confirmed such results.
Keywords - load balancing algorithm, heterogeneous environments, high
performance, simulation, and prototype.
I. Introduction
The availability of low cost microprocessors and the evolution of the computing
networks have made feasible the construction of distributed systems. On such
systems, the processes are executed on the network computers, which
communicate to one another to perform the same computing operation. In order
to distribute the processes among these systems, a load balancing algorithm may
be adopted.
2
The load balancing algorithms try to equally distribute the load process among the
computers that make part of the environment. Krueger and Livny [1] demonstrate
that the load balancing methods reduce the average and standard deviation of the
processes response times. Shorter response times are desirable, as they show
higher performance on the processes execution.
The load balancing algorithms involve four policies: transference, selection,
location and information [2, 3]. The transference policy classifies a computer as
the emitter or receptor of the processes based on its availabilit y status. The
selection policy defines the process that should be transferred from the busiest
computer to the idlest one. The location policy defines which computer should
receive or send a transference process. A serving computer offers the processes
when it is overloaded, a receiving computer requests processes when it is idle.
The information policy defines when and how the information on the computers’
availabilit y is updated on the system.
Several load balancing methods have been proposed [2, 4, 5] According to
Shivarathi et al [2], these algorithms can be subdivided into: server-initiated,
receiver-initiated, symmetric initiated, stable server-initiated and stable
symmetrically initiated. Out of these algorithms, the one that shows the highest
performance is the stable symmetrically initiated.
Despite of the recent works in the load balancing area [11-22], the studies and
algorithms from Krueger and Livny[1], Ferrari and Zhou [4], Theimer and Lantz
[5] and Shivaratri, Krueger and Singhal [2] have been considered the main works
for this area. The recent works don’ t bring significant contributions to the
processes’ performance and reduction of the messages number on the computers
network. The work presented in this paper, as the recent works in the area, is
based on the main studies of the load balancing area.
Later on, Mello, Paiva and Trevelin, during others studies on clusters and
distributed systems [6, 7, 8, 9], observed and proposed a new method for load
balancing. Through these studies, Mello, Paiva and Trevelin [10] concluded that
by choosing a load index based on the process behavior and on the computer
capacity at the environment it would be possible to obtain better results for the
server-initiated algorithms. On these studies, the authors show a comparative
study between the Lowest algorithm, which uses a new load index that is based on
the process behavior and on the computers’ capacity and the Stable Symmetrically
3
Initiated algorithm. According to the obtained results, the modified Lowest
algorithm shows the highest performance. These analyses have been based on the
studies of Shivaratri et al [2], and confirm that the adequate load index choice
modifies the behavior of the load balancing algorithm.
Vital aspects for the success of a load balancing algorithm are: stabilit y on the
number of messages generated at the environment in order to not overload it;
support to environments composed by heterogeneous computers and processes;
low cost update on the environment load information; low cost choice for the
ideal computers to execute a process received by the system; stabilit y on high
load; system scalabilit y, and short response times. The study and analysis of each
above mentioned aspect generated a new load balancing algorithm named TLBA
(Tree Load Balancing Algorithm).
This paper presents the TLBA, a new load balancing algorithm addressed to
environments where are desirable: high scalabilit y; support to heterogeneous
processes and computers; decrease in the number of messages that pass through
the communication means; stabilit y on high loads, and short response times. A
comparative study between the TLBA and the modified Lowest algorithm has
been proposed by Mello, Paiva and Trevelin [10] by using simulators. The TLBA
is compared to modified Lowest algorithm because the last presented better
results than Stable symmetrically initiated. After simulating and proving the
performance gains of TLBA, a prototype has been implemented. This prototype
was submitted to tests and the results confirmed the values obtained through the
simulations.
The paper has been divided into the following sections: section II presents the
State of Art; section III presents the Tree Load Balancing algorithm; section IV
presents the comparative studies based on the simulations; section V presents the
TLBA prototype and compares it to the results obtained on the simulations;
section VI presents the conclusions; and, finally, the references.
II. The State of the Art
Among the main studies on load balancing, we may highlight the ones by Zhou
and Ferrari [4], Theimer and Lantz [5], Shivaratri, Krueger and Singhal [2] and
Mello, Paiva and Trevelin [10]. Other recent works were proposed [11-22],
however they don’ t offer significant performance contributions. These recent
4
works are also based on the studies of Zhou and Ferrari [4], Theimer and Lantz
[5] and Shivaratri, Krueger and Singhal [2].
Zhou and Ferrari [4] evaluate four server-initiated load balancing algorithms, i.e.
the most overloaded computer. These algorithms are: Disted, Global, Central,
Random and Lowest. On Disted, when a computer realizes any alteration on its
load, it emits messages to the other computers to inform them about its current
load. On Global, there is a computer that centralizes all the computers’ load
information on that environment and this centralizing computer sends broadcasts
to the environment in order to keep the other ones updated about the situation. On
Central, as on Global, a central computer receives all the load information of the
system, however, it does not update the other computers about it. This centralizing
computer decides about the resources allocation at the environment. On Random,
no information about the environment load is handled. On this algorithm, a
computer is selected by random in order to receive a process to be initiated. On
Lowest, the load information is sent when demanded. When a computer starts a
process it requests information and analyzes the loads of a small set of computers
and submit the processes to the idlest one. The idlest computer has the shortest
processes’ queue.
When Zhou and Ferrari analyzed the Disted, Global, Central, Random and Lowest
algorithms, they reached the conclusion that all of them show a performance
higher than the one presented by a system with no load balancing at all . However,
the Lowest algorithm showed the highest performance of all . This algorithm
generates demand messages to transfer the process, causes a lower overload on the
communication means and, thus, allows the incremental increase of the
environment. The number of generated messages on an environment does not
depend on the number of computers.
Theimer and Lantz [5] have implemented algorithms similar to the Central, Disted
and Lowest ones. Nevertheless, they have analyzed such algorithms at systems
composed of a larger number of computers (about 70). For the Disted and Lowest
ones, some process reception and emission groups were created. The
communication within these groups was made by using a multicast protocol, in
order to minimize the messages exchange among the computers. Computers with
a lower than limited load have participated in the process reception group whilst
5
the computers with a higher than limited load have participated in the process
emission group.
Theimer and Lantz recommend decentralized algorithms such as Lowest and
Disted, as they do not generate single points of fault as Central does. Central
presents the highest performance for small and medium size networks but it
degrades on environments composed by too many computers.
Through the analyses, Theimer and Lantz concluded that algorithms such as
Lowest work with the probabilit y of a computer being idle [5]. Thus, they assume
the system homogeneity as they use the extension of the CPU’s waiting queue as
load index. The process behavior is not analyzed; therefore, the actual load of
each computer is not measured.
Shivaratri, Krueger and Singhal [2] analyze the algorithms server-initiated,
receiver-initiated, symmetrically initiated, adaptative symmetrically initiated and
stable symmetrically initiated. On these studies, the length of the process waiting
queue at the CPU was considered as the load index. This measure has been chosen
because it is simple and, therefore, consumes fewer resources to be obtained.
Shivaratri, Krueger and Singhal have concluded that the receiver-initiated
algorithms present a higher performance than the server-initiated ones, such as
Lowest. However, on their conclusions, the algorithm that showed the highest
final performance was the stable symmetrically initiated one. This algorithm
preserves the history of the load information exchanged in the system and take
actions to transfer the processes by using this historic information.
Mello, Paiva and Trevelin [10] analyze a load index based on the process behavior
as well as on the system computers capacity. The authors propose a comparison
between the Stable Symmetrically Initiated algorithm by Shivaratri et al [2] and
the modified Lowest algorithm. The modified Lowest uses an analysis of the
process behavior as well as the system computers capacity as load index. Through
these studies, it was concluded that, by using a new load index, the Lowest
algorithm provides better response times. These results were reached through the
simulation of homogeneous and heterogeneous environments.
6
III. The Tree Load Balancing Algorithm: A New
Model
The TLBA algorithm, named Tree Load Balancing Algorithm, creates a virtual
interconnecting tree among the computers of the system. On this tree, each
computer of an N level sends its updated load information to the N-1 level
computers. The lower is the level the closer it is to the root. The root is located on
level zero.
The selection of the best computer to execute a process received by the system
works as a deep search on the interconnecting tree. Thus, an N level may define
which of its branches located on an N+1 level presents the lowest occupation, by
re-passing a search message that is propagated through all the branches of the tree
until it finds the idlest computer.
The tree may be defined as a non-cyclic connected graph. Being G = (X, U) a
graph with n > 2, the following properties are equivalent to characterize G as a
tree:
1) G is connected and non-cyclic;
2) G is non-cyclic and has n-1 edges;
3) G is connected and has n – 1 edges;
4) G is non-cyclic and, by adding one edged, only one cycle is created;
5) G is connected, but stops being so if one edge is suppressed;
Every G’s pair of vertices is connected by only one simple chain.
After defining the tree, it is necessary to define the levels that make it up. Being
an oriented graph G = (X, R-), which does not presents any circuits, a partition of
N of X may be defined as:
N = N0, N1, ..., Nr
Where the elements N0, N1, ..., Nr, called as levels, are arranged by the inclusion
of their R-1 as follows:
N0 = xi / R-1 (xi) =
N1 = xi / R-1 (xi) C N0
N2 = xi / R-1 (xi) C (N0 U N1)
Nr = xi / R-1 (xi) C U (de k=0 a r-1) Nk
7
Therefore, the vertices of a level are only anteceded by the previous levels, what,
evidently, is only possible if the graph does not have circuits.
The last level will verify R+ (Nr) = , if there are no more successors on the
graph.
A) Update of the Information Used to Find the Best
Computer in the System
On TLBA, each computer of the system has an updated table with the occupation
information. On this table, there are included its information as well as the one
related to the other computers that make part of its subtree. The subtree of a α
computer, located on the N level, is given by all the computers from the N+1
level, which succeed to a computer through an oriented arch. The α computer
participates in this subtree, being defined as its father.
Figure 1 – Subtree of a α computer
The computers from level N+1 send information about the occupation to the
antecedent computer located on the N level. The father computer generates a
single number that represents its occupation and the one of its subtree. To make
this calculation, the father computer sums up the occupation related to the
capacities of each computer of its subtree, generating a single load value for the
subtree controlled by it. The load value is propagated until the root node.
An N+1 level computer only sends its load information when it exceeds a certain
threshold percentage value. Assuming a threshold of 5%, the current load is
compared to the one taken at the latest reading; if the current load is 5% above or
below the previous load, then the load information is sent to the father computer,
which is located on the N level. The definition of this threshold allows the
decrease in the number of messages related to the information update on the
system. By decreasing the overload on the communication means the stabilit y
increases [2].
α
β χ
Level
0
Level 1
8
B) Load Index Calculation
The load index used by TLBA is based on the analysis of the CPU and memory
resources consumed by the processes. The reading of this resources occupation is
given as:
O(i, r, t) = Nread(i, r, t)
where:
i is the computer’s identifier
r is the computing resource (CPU and memory)
t is the reading instant
O is the occupation of the read resource
Nread is the occupation reading function
After the occupation reading of each resource is done, it is calculated the load
index that represents the load of the analyzed computer, given by the equation:
Ncarga(i, t) = (O(i, CPU, t)/CAPcpu) + (O(i, MEMO, t)/CAPmemo)
where:
i is the computer’s identifier
t is the occupation reading instant
CPU and MEMO are the resources
O(i, CPU, t) is given by the sum of the CPU’s occupation by each process of the
system at a certain interval
O(i, MEMO, t) is given by the sum of the memory occupation by each process of
the system at a certain interval
CAPcpu is the computer CPU’s capacity at a certain interval
CAPmemo is the computer memory capacity at a certain interval
This load index considers the CPU and memory occupations of each computer of
the system. The occupation of each resource is divided by its total capacity,
generating an occupation percentage. By summing up the CPU and memory
occupations it is generated a single index, which is handled by the whole system.
9
C) Selecting the Best Computer of the System
The selection of the best computer to execute a process received by the system is
made by using the tables contained in each computer of the tree. Each computer
has the updated information related to its subtree. This information is propagated
until it reaches the root computer of the system, located on level zero. A request to
execute the process in the system is always received by the root computer and this
computer starts a deep search through the tree until it finds the best computer of
the environment.
The deep search in the tree is started on the root computer. This computer
compares its load to the ones belonging to the other computers of its subtree. See
that the root computer doesn’ t have the load information of all system computers.
The root computer has only a load value that resumes the load of its subtrees. If
the root computer had all system computers’ load it could send the process
directly to the laziest node. Although this technique would be similar to the
Central algorithm [4], generating many messages, overloading the computers
network. In case the root computer has the lowest load, it executes the process. If
not, it sends a request to the best computer of its subtree, which continues the
same searching process until it finds the idlest computer in the system.
D) Migration and Checkpointing of the Initiated Processes
The TLBA provides support for the checkpointing and migration of processes
initiated in the system. When a computer updates its local load information it also
analyzes the percentage occupation of the system. In case the system load is above
a pre-determined limit it generates an analysis message for the migration of
processes that will follow. This message is sent to the root computer of the
system, which calculates the average occupation of the whole system.
The average occupation of the whole system is calculated by dividing the total
load of the system by the number of computers. The root computer is capable of
making this calculation as it has the updated load information of the whole
system. This value calculated by the root computer is sent back to the computer
that has sent an analysis message for the migration that will succeed.
The computer that sent and analysis message for the succeeding migration
compares its load to the average load of the remainder computers of the system. If
its load is higher than the other ones it starts the checkpointing and requests the
10
root computer to search for the idlest computer os the system (section III .C). After
finding the idlest computer, it submits one of its processes. If its load is lower or
similar to the ones belonging to the remainder computers in the system, it does not
make the checkpointing and migration of processes.
E) Other considerations
The computers participate of the tree through the decisions made by a Broker.
This Broker is a software component, which receives interested computers
requisitions in participating of the tree. The Broker defines where a computer will
be put. The goal of the Broker is create a balanced tree according to the computers
capacity. The computers capacity is measured though a benchmark program, this
program is run before the computer placed in the tree.
The computers of a higher or lower level detect the fail of a computer. The higher
levels identify fails when try to send load information (section III .A). The lower
levels identify fails when select the best computer to run a process (section III .C).
Once a computer detects a fail it sends a message to the Broker, which remount
the tree structure. The Broker can be running on several computers avoiding fails.
If a system administrator doesn’ t want to use the Broker he/she can manually
configure the environment, setting the tree levels and computers per level.
However, manually configurations cannot use the advantages of the fail treatment
offered by the Broker.
Another relevant aspect is that the system administrator defines the number of
nodes per tree level. Several nodes per level can increase the generated messages
between levels, overloading the computers network. Fewer nodes per level
diff icult the deep search to find the best computer to run a process (section III .C).
IV. Simulations
The proposed simulations try to compare the behavior of the TLBA to the
modified Lowest one [10]. The modified Lowest showed a better performance
than the Stable Symmetrically Initiated, which was proposed by Shivaratri et al
[2] as the best solution. The simulations are subdivided into the following groups:
1) Simulations analyzing a system composed of 7 homogeneous and
heterogeneous computers.
11
2) Simulations analyzing a system composed of 781 homogeneous and
heterogeneous computers.
3) Simulations to analyze the number of messages that pass through the
communication network into the TLBA and modified Lowest.
These simulation groups were proposed to permit environment complete analyses;
in this way the TLBA was tested for large and small systems. To complete the
tests the simulations were done using homogeneous and heterogeneous
environments. Then, the analyses cover the main possibilities of an environment.
We used 7 and 781 computers just to create symmetrical trees for the tests. For 7
computers it was created a tree composed by 3 levels, where it was 2 computers
per level. For 781 computers it was created a tree composed by 5 levels, where it
was 5 computers per level.
This section uses tables to present the simulations data, because they manifest
more data precision. It was used Poisson to simulate the process arrival to the
system, the same probability distribution function used by Zhou and Ferrari [4],
Theimer and Lantz [5] and Shivaratri, Krueger and Singhal [2].
The following sections show the simulation groups and their respective analyzes.
A) Comparing the Lowest and TLBA response times at an
environment composed by 7 computers
Table 1 presents the results obtained by the simulations conducted at an
environment composed by 7 homogeneous computers. The modified Lowest
considered a subnet of three computers to define the idlest one and, therefore, to
receive a process initiated in the system. The Poisson probability distribution
function used to define the processes arrival is 100 seconds, in average. The
capacity of each computer in the system is 368 cycles per millisecond. 320
processes have been submitted to the system.
12
TABLE 1 – Response time variation of the Lowest algorithm to the different occupations of the
process
Occupation per
Process (in
milli on of cycles)
Response Time of the modified
Lowest algorithm in a
homogeneous environment (in
milli seconds)
TLBA Response Time in a
homogeneous environment
(in milli seconds)
10 612755.33 604501.43
100 6394020.73 6317613.53
1000 64152160.20 63446532.46
10000 640987379.63 634735601.75
100000 6418958569.10 6347626353.23
1000000 65020352364.06 63476532508.20
10000000 643614100672.71 634765594915.75
100000000 6426120894695.79 6347656220029.75
1000000000 64770720080292.50 63476562469883.10
10000000000 642306385839865.00 634765624969849.00
100000000000 6576426630405830.00 6347656249969880.00
1000000000000 64104959239102000.00 63476562499968600.00
Through table 1 it were obtained the exponential regressions of each algorithm on
the simulated situation. Following this, equations that express the behavior of the
algorithms on the simulated situation were obtained. The equations of the
modified Lowest and TLBA algorithms are presented on figures 2 and 3,
respectively.
y = 63,011 e 2.3054 x
Figure 2- Equation of the modified Lowest algorithm for the simulation at an environment
composed of 7 homogeneous computers
y = 62,358 e 2.3046 x
Figure 3- Equation of the TLBA for the simulation at an environment composed of 7
homogeneous computers
It should be mentioned that every x of the exponential equations reflects the
unitary number of cycles and y reflects the response time. Thus, in order to
analyze an application load with a milli on cycles it should be used x = 1,000,000.
13
By analyzing the exponential equations obtained through the simulations of the
TLBA and the modified Lowest, it may be concluded that the TLBA equation
presents the shorter response time. This may be noted through the analysis of the
equation thetas. The TLBA theta1 is 62,358 whereas the modified Lowest one is
63,011 – in this case, the lower is the theta the shorter will be the generated
response time. Besides that, the theta2 is 2,3046 at TLBA and 2,3054 at the
Lowest. Therefore, TLBA generates a function that takes longer to increase the
response times.
Table 2 shows the results obtained on the simulations conducted at an
environment composed of 7 heterogeneous computers. The simulated situation is
the same one of the homogeneous environment where the Poisson probabilit y
distribution function related to the process arrival is 100 seconds in average; the
capacity of each computer in the system is 368 cycles per milli seconds. where 320
processes have been submitted to the system and a subnet of 3 computers was
taken into account so that the modified Lowest could find the idlest computer.
TABLE 2 – Lowest algorithm response times for the different process occupation at a
heterogeneous system
Occupation per
Process (in milli on of
cycles)
Response Time of the
modified Lowest algorithm in
a heterogeneous environment
(in milli seconds)
TLBA Response Time in a
heterogeneous environment
(in milli seconds)
10 874367.01 834312.15
100 8929430.36 8626009.67
1000 89918432.93 86541094.87
10000 897244022.45 865692066.19
100000 8969711464.11 8657202652.08
1000000 90185968862.68 86572308609.92
10000000 902407938340.81 865723367688.47
100000000 8981001449908.22 8657233958852.14
1000000000 89777274689289.90 86572339868669.00
10000000000 898600800041058.00 865723398968154.00
100000000000 8955454881991950.00 8657233989964290.00
1000000000000 89532098573832300.00 86572339899923800.00
14
The exponential equations that express the behavior of the modified Lowest and
TLBA algorithms are shown on figures 4 and 5, respectively.
y = 88,996 e 2.30x
Figure 4 – Equation of modified Lowest algorithm on the simulation at an environment composed
of 7 heterogeneous computers
y = 85,415 e 2.30x
Figure 5 – Equation of TLBA algorithm on the simulation at an environment composed of 7
heterogeneous computers
On the exponential equations of f igures 4 and 5, which were obtained from the
simulations on the Lowest and TLBA algorithms, respectively, it may be
concluded that the theta1 of the Lowest equation is higher, what implies in
generating longer response times. Therefore, the TLBA behavior is also better at
heterogeneous environments, showing shorter response times.
The exponential equations obtained on the simulations at environments composed
of 7 computers may be used to forecast the algorithm behavior at different loads.
The equations may be applied to different process occupation values (x axis),
resulting in response times (y axis). In all cases, TLBA has generated exponential
equations with lower thetas, so it can be defined as the best algorithm for the
presented cases.
B) Comparing the response time of Lowest and TLBA at an
environment composed of 781 computers
The following simulations compare the modified Lowest to the TLBA at a
homogeneous environment composed of 781 computers. On these simulations,
35,145 processes were submitted to the environment, based on the Poisson
probabilit y distribution function with average 100 seconds. Each computer
capacity is 368 cycles/ms. For the Lowest algorithm it was considered a subnet of
eleven computers that were selected on a random basis and their loads were
analyzed to allow the definition of the idlest computer. Table 3 presents the results
obtained on the simulations.
15
TABLE 3 – Response time related to the occupation per process of Lowest algorithm
Occupation per
Process (in milli on
of cycles)
Response Time of the modified
Lowest algorithm in a
homogeneous environment (in
milli seconds)
TLBA Response Time in a
homogeneous environment
(in milli seconds)
10000 621599956.08 621564741.84
100000 6246908556.10 6246573454.81
1000000 62500016636.37 62496667285.42
10000000 625031931013.69 624997605085.41
100000000 6250345430262.57 6250006980417.56
1000000000 62503491403134.80 62500100730700.10
10000000000 625034828968439.00 625001038228383.00
100000000000 6250347547420140.00 6250010413228300.00
1000000000000 62503488649116800.00 62500104163224200.00
By doing the exponential regressions on the data obtained and presented on table
3, the exponential equations were generated and they express the behavior of the
modified Lowest algorithm (figure 6) and TLBA (figure 7).
y = 62,337,175.42 e 2.30x
Figure 6 – Equation of modified Lowest algorithm on the simulation at an environment composed
of 781 homogeneous computers
y = 62,333,740.06 e 2.30x
Figure 7 – Equation of TLBA algorithm on the simulation at an environment composed of 781
homogeneous computers
Analyzing the resulting equations (figures 5 and 6), we may note that the theta1
for the TLBA exponential equation is lower than the one at modified Lowest and
this implies in shorter response times when using the TLBA load balancing.
The same environment composed of 781 computers was simulated, however, this
time, computers with heterogeneous capacities were used: 391 of them had a 368
cycles/ms capacity and 390 of them had a 139 cycles/ms capacity. The obtained
results on these simulations are shown on table 4.
16
TABLE 4 – Response times obtained at different process occupations of Lowest algorithm
Occupation per
Process (in milli on
of cycles)
Response Time of the
modified Lowest algorithm in
a heterogeneous environment
(in milli seconds)
TLBA Response Time in a
heterogeneous environment (in
milli seconds)
10000 903520989.32 903494690.55
100000 9066120728.99 9065872768.52
1000000 90692197539.30 90689656980.88
10000000 906952556067.72 906927490948.86
100000000 9069556015328.95 9069305844417.15
1000000000 90695569517117.60 90693089372139.10
10000000000 906956131430007.00 906930924662546.00
100000000000 9069558757488640.00 9069309277553760.00
1000000000000 90695602819271600.00 90693092806468100.00
Through the information shown on table 4, regressions to the modified Lowest
algorithm and TLBA were generated. Out of these regressions the exponential
equations that express the behavior of both algorithms were obtained. Figures 8
and 9 show, respectively, the equation for the modified Lowest algorithm and
TLBA.
y = 90,529,356.61 e 2.30x
Figure 8 – Equation of modified Lowest algorithm on the simulation at an environment composed
of 781 heterogeneous computers
y = 59,753.94 e 2.30x
Figure 9 – Equation of TLBA algorithm on the simulation at an environment composed of 781
heterogeneous computers
Observing the equations of f igures 8 and 9 it may be noted that the theta1 related
to the results of the Lowest algorithm is much higher than the one of TLBA and
this confirms the worst response times of the modified Lowest. Despite Lowest is
showing response times close to the ones presented by the TLBA, see on tables 3
and 4 that these times get much worse when the environment load increases.
17
C) Simulations that analyze the number of messages
generated for the TLBA and modified Lowest algorithms
Simulations were carried out to analyze the traff ic of messages exchanged among
the computers when using the TLBA and modified Lowest algorithms. The
simulated environment is composed of 7 computers segmented into three levels
with two computers on each level. In the modified Lowest case, the subset of
computers analyzed before the load balancing operation is composed of 3
computers. The number of processes that arrive to the system varies from 10 to
350. The Poisson probabilit y distribution function related to the process arrival is
average 3 seconds. Each computer capacity is 368 cycles per milli second.
The graph on figure 10 shows the number of messages exchanged among the
computers on Lowest algorithm and among TLBA levels.
Messages by Level at na Environment composed of 7 homogeneous computers
0
500
1000
1500
2000
2500
3000
10 50 90 130
170
210
250
290
330
Number of Processes
Mes
sage
s by
Lev
el
TLBA
Lowest
TLBA single network
FIGURE 10 – Number of messages exchanged among the computers on Lowest and TLBA
algorithms
Due to the TLBA tree format it is possible to segment each level at a physically
distinct network and this causes less overload on the communication means.
Figure 10 shows three simulated situations: the first one using the TLBA where
there is a physically distinct network for each level; the modified Lowest; and the
TLBA at a physically single network.
The TLBA generates fewer messages per level than the ones generated by the
modified Lowest. The increase rate related to the number of messages that pass
18
through the communication network is much higher on Lowest. Even when using
the TLBA at a single network, there is a significant decrease on the number of
messages when compared to the Lowest.
Both TLBA and modified Lowest show variations on the number of messages
depending on the number of executed processes. Lowest generates messages
based on the computers subnet analyzed to initiate a load balancing operation,
whilst TLBA generates a deep search message about the tree to search the best
computer to execute the process. The TLBA also generates updating messages on
the environment occupation and these messages run from the leaves to the root of
the virtual tree.
TABLE 5 – Number of messages achieved at 350 processes of the environment
Modified Lowest TLBA
TLBA at a single
network
2450 283.3333 850
Analyzing table 5, we may conclude that the number of messages on the TLBA is
8.647 times lower. The TLBA on a single network generates 2.88235 times less
messages than the modified Lowest. On both situations the TLBA causes less
overload to the communication means.
B) Simulations to analyze the number of messages that
pass through the communication network on the modified
Lowest and TLBA algorithms based on the number of
computers per level
Simulations were conducted to analyze the impacts caused by the variation on the
number of levels and number of computers per level that make up the TLBA tree.
The environment used on these simulations contained 781 homogeneous
computers with a capacity of 368 cycles per milli second. 3,000 processes were
submitted to the environment based on the Poisson probabilit y distribution
function of average 3 seconds. Three situations were simulated for the TLBA:
1) The first one had two levels with one computer at the first level and 780 at the
second one;
2) the second one had five levels with one computer at the first level and five
computers on the remainder ones;
19
3) the third one with 781 levels with one computer per level.
Messages per level for different number of Computers per Level
0.00
100000.00
200000.00
300000.00
400000.00
500000.00
600000.00
Number of Computers per Level
Mes
sage
s pe
r L
evel
Lowest
Tree
Lowest 69000.00 69000.00 69000.00
Tree 5996.00 182.36 566133.00
780.00 5.00 1.00
FIGURE 11 – Messages exchanged among computers segmented by level
Figure 11 shows the results obtained on the TLBA and modified Lowest
algorithm. We may note that Lowest generates a constant number of messages
that are equal to 69,000, whilst the number of messages on TLBA varies
according to the number of levels and computers per level. This variation is about
182.36 for 5 levels, 5,996 for 2 levels and 566,133 for 781 levels.
The choice for an intermediate number of levels generates a lower traff ic on the
communication means, what increases the system scalabilit y. When a computer
per level is used, there are no interesting results because there is an overload on
the communication among the computers, what decreases the final eff iciency of
the system making it not scalable. The overload happens due to the number of
levels roamed to find out the best computer in the system to execute a received
process and to update the information that propagate forward the root computer.
THE DEFINITION OF AN INTERMEDIATE NUMBER OF LEVELS ALLOWS THE CREATION
OF SUBNETS. EACH SUBNET IS FORMED BY A NUMBER OF COMPUTERS. THE SYSTEM
SEGMENTATION INTO LEVELS COMPOSED OF PHYSICALLY DISTINCT NETWORKS
GENERATES A LOWER TRAFFIC ON THE COMMUNICATION MEANS, AS MOST PART OF
THE MESSAGES IS EXCHANGED AMONG THE COMPUTERS OF A CERTAIN LEVEL,
WHICH TRANSIT ON THE PHYSICAL NETWORK OF THIS LEVEL. TO CREATE A SUBNET
20
IS NECESSARY A SWITCH PER LEVEL. FOR INSTANCE, A SWITCH IS USED TO
INTERCONNECT THE COMPUTER PLACED IN A N LEVEL, TO ONES OF THE N-1 LEVEL.
A COMPUTER OF THE N-1 LEVEL NEEDS TO HAVE TWO NETWORK INTERFACE CARDS,
ONE TO CONNECT TO N LEVEL THROUGH ONE SWITCH AND ANOTHER TO CONNECT
TO N-2 LEVEL. ALTHOUGH TO CREATE DIFFERENT SUBNETS, THE SYSTEM
ADMINISTRATOR MUST DEFINE THE TREE STRUCTURE AS STATIC, SO HE/SHE CANNOT
USE THE BROKER AS IMPLEMENTED SO FAR (SECTION III .E).
V. Comparative Evaluation of the Results Obtained
on the Executions of a TLBA Prototype and
Simulator
In order to confirm the TLBA’s feasibilit y, a prototype was implemented using
the GNU/GCC, a Free Software compiler that executes several platforms [23],
and the Linux operating system as the test environment. This section compares the
results obtained on the executions of this prototype to the simulation results.
The prototype was submitted to two environment configurations: a homogeneous
one and a heterogeneous one. The homogeneous environment was composed of
seven computers with a 368 cycles/ms capacity. The heterogeneous environment
was composed of 2 computers with a 139 cycles/ms capacity and 5 computers
with a 368 cycles/ms capacity. The simulated and real environment used the same
parameters this permit to compare both.
Table 6 presents the results obtained on the prototype executions and the
simulation results at homogeneous environments. The executions were carried out
on processes that occupy 10,330,000 cycles on three distinct situations. On the
first case, processes that occupied only CPU, without doing the context key, were
submitted. On the second case, processes that occupied the CPU and made calls to
the kernel’s sleep function were submitted, which totaled 512,500 milli seconds
per process. On the third and last case, processes that made up sleep time of
1,024,000 milli seconds each were submitted. These sleep times were defined so
that the processes could last a longer time to be completed, which allowed the
whole system to have more processes executing simultaneously. In all cases, 350
processes were submitted to the environment by using a Poisson function for
probabilit y distribution of process arrival with average 3 seconds. Two computers
21
per level were defined to make up a communication tree among the TBLA
modules, as shown on figure 12. All computers were connected to a 100 Mbits
switch.
Figure 12 – Connecting tree created by TLBA
The number and occupation of processes submitted to the environment were
defined to obtain a TLBA analysis for high load environments. The three analyzed
cases showed the results in situations where there are load peaks; situations were
the high load is maintained for an average period of time; situations where the
load is maintained for long periods of time.
TABLE 6 – Response times at the homogeneous environment (using the prototype and the
simulator)
Cycles
Simulator (In
milli seconds)
Prototype (Actual Environment) - In
milli seconds Variation 10330000 698,125.6005 697,645 0.00068889
10330000 + 512500 sleep ms 1,210,625.6 1,194,509 0.013492239
10330000 +1024000 sleep ms 1,722,125.6 1,760,153 0.022081664
Level 1
Level 2
Level 3
Computer
Computer Computer
Computer Computer Computer Computer
22
Actual Execution x Simulation for 7 homogeneous computers
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
10330000 10330000 + sleep500 * 1024
10330000 + sleep500 * 2048
Execution Types
Res
po
nse
Tim
e (i
n m
illis
eco
nd
s)
Simulation
Actual Execution
FIGURE 13 – Response Times for the prototype executions and simulations at a homogeneous
environment
Observing table 6, a small variation between the actual and simulated values may
be noted. This variation is from 0.068889% to 2.2081664%. The low variance on
results validates the simulator use for the feasibilit y analysis on the creation of
load balancing environments with the TLBA. The graph on figure 13 shows these
results.
Table 7 shows the average number of load balancing messages exchanged on the
actual environment, represented by the prototype and simulation executions. The
variance on the number of messages between the actual system to the simulated
one is of 5.8823529%, and this figure also confirms that the simulator can be used
to forecast the behavior of a certain environment that aims to use the TLBA.
Table 7 shows a column for the number of messages exchanged among the levels
and another one for the total messages exchanged on the environment, if the
option of using a single network is chosen, the number of generated messages is
found on the “Total” column. If the option of using physically distinct networks
is chosen, the average number of messages is found on the “per level” column.
23
TABLE 7 – Messages passing through the network to the actual and simulated environments
Per Level Total Simulator 283.3333333 850
Prototype (Actual Environment) 300 900 Variation 0.058823529 0.058823529
The actual executions and simulations on the heterogeneous environment were
conducted according to the same pattern followed on the homogeneous
environment. The heterogeneous environment was composed of 5 computers with
a 368 cycles/ms capacity and 2 computers with a 139 cycles/ms capacity.
Table 8 shows the results obtained on the executions of the TLBA prototype and
simulator at a determined heterogeneous environment. The processes were
submitted on the same three situations previously defined for the homogeneous
environment. On each test, 350 processes were submitted to the environment by
using a Poisson function of distribution of the probabilit y of average 3 seconds. It
may be noted that the variation between the simulated response times and the
actual ones was from 1 to 4%. The graph on figure 14 shows these results.
TABLE 8 – Response times at the heterogeneous environment (using the prototype and the
simulator)
Cycles
Simulator (In
milli seconds)
Prototype (Actual environment) -
In milli seconds Variation 10330000 781,348.8903 750,352.93 0.04
10330000 + 512500 ms de sleep 1,293,848.89 1,279,203.00 0.01
10330000 +1024000 ms de sleep 1,805,348.89 1,742,498.00 0.04
24
Actual Execution x Simulation for 7 heterogeneous computers
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
10330000.00 10330000 + sleep500 * 1024
10330000 + sleep500 * 2048
Execiution Types
Res
po
nse
Tim
e (i
n m
illis
eco
nd
s)
Simulation
Actual Execution
FIGURE 14 – Response times for the prototype executions and simulations at a heterogeneous
environment
Table 9 shows the number of messages generated both on the actual environment
(prototype) and on the simulated one. It may be noted tat the variation between
the expected and actual number of messages is of about 7%.
TABLE 9 – Messages passing through the network to the actual and simulated environments
Per level Total Simulator 257 771
Prototype (Actual Environment) 275 825 Variation 0.070038911 0.070038911
As it may be observed, the values obtained on the simulations were confirmed
afterwards by the prototype execution and this validates the performed
simulations. It can also be highlighted that all actual executions were conducted
with the aim of overloading the system. Thus, the comparisons between the
prototype (actual system) and simulator could be made under extreme situations.
On these situations the low variation on the number of exchanged messages to
perform the load balancing operations and processes response times were
confirmed.
25
VI. Conclusions
The paper proposes and analyzes the TLBA algorithm, which creates a virtual
connecting tree to update the information on the system occupation. This tree is
used to carry out deep searches that allow the idlest computer to be found as this
computer is the one that should execute a process received by the system at a
certain instant. Analyzing the simulations presented on section IV, it was
concluded that the TLBA algorithm decreases the number of messages generated
at a certain environment and generates shorter response times than the modified
Lowest algorithm [10].
In order to confirm the results obtained on the simulations, a prototype was
implemented over the Linux operating system. This prototype was submitted to
tests whose results allowed the confirmation of the low variation on the response
times and number of exchanged messages on the load balancing operations of the
simulated and actual (prototype) environments.
The low variances between the TLBA prototype and simulator confirmed the
simulator usefulness to forecast the behavior of environments that use the TLBA
for load balancing and, consequently, the validation of the comparisons made
between the TLBA and modified Lowest algorithms (section IV) where the TLBA
showed the best results.
References
[1] KRUEGER, P. and LIVNY, M., “The Diverse Objectives of Distributed Scheduling Policies” , Proc.
Seventh Int’ l Conf. Distributed Computing Systems, IEEE CS Press, Los Alamitos, Cali f., Order Nr
801 (microfiche only), 1987, pp. 242-249.
[2] SHIVARATRI, N. G., KRUEGER, P. and SINGHAL, M., “Load Distributing for Locally Distributed
Systems”, IEEE Computer, December 1992, pp. 33-44.
[3] SINHA, P. K., “Distributed Operating Systems: Concepts and Design” , John Wiley & Sons, 1997.
[4] ZHOU, S. and FERRARI, D., “An Experimental Study of Load Balancing Performance”, Report Nr
UCB/CSD 87/336, January 1987, PROGRES Report Nr 86.8, Computer Science Division (EECS),
University of Cali fornia, Berkeley, Cali fornia 94720.
[5] THEIMER, M. M. and LANTZ, K. A., “Finding Idle Machines in a Workstation-Based Distributed
Systems”, In Proc. IEEE 8th Int. Conf. On Distributed Computing Systems, pp. 112-122, 1988.
26
[6] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., GONZAGA, A., “Model for Resources Load
Evaluation in Clusters by Using a Peer-to-Peer Protocol” , The 2002 International Conference on
Parallel and Distributed Processing Techniques and Applications, Las Vegas, USA, 24 - 27 June 2002.
[7] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., “OpenTella: A Peer-to-Peer protocol for the load
balancing in a system formed by a cluster from clusters” , 4th International Symposium on High
Performance Computing ISHPC-IV, Japan, May 2002.
[8] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., “Analysis on the Significant Information to Update
the Tables on Occupation of Resources by Using a Peer-to-Peer Protocol” , 16th Annual International
Symposium on High Performance Computing Systems and Applications, 17-19 June 2002, Moncton,
New-Bruncwick, Canadá.
[9] MELLO, R. F., PAIVA, M. S., TREVELIN, L. C., “A Java Cluster Management Service”, IEEE
International Symposium on Network Computing and Applications, February 2002, Cambridge,
Massachusetts, USA.
[10] MELLO, R. F., TREVELIN, L. C., PAIVA, M. S., “Comparative Study of the Server-Initiated Lowest
Algorithm using a Load Balancing Index Based on the Process Behavior for Heterogeneous
Environment” , Submitted to the Kluwer The Journal of Networks, Software Tools and Applications,
August 2003.
[11] HUI, C.; CHANSON, S.T. (1999). Hydrodynamic Load Balancing. IEEE Transactions on Parallel
and Distributed Systems, v. 10, n.11, p. 1118-1137, November 1999.
[12] MITZENMACHER, M. (2000). How Useful Is Old Information? IEEE Transactions on Parallel and
Distributed Systems, v. 11, n.1, pp. 6-20, January 2000.
[13] ANDERSON, J.R.; ABRAHAM, S. (2000). Performance-Based Constraints for Multidimensional
Networks. IEEE Transactions on Parallel and Distributed Systems, v. 11, n.1, p. 21-35, January 2000.
[14] AMIR, Y. et al. (2000). An Opportunity Cost Approach for Job Assignment in a Scalable Computing
Cluster. IEEE Transactions on Parallel and Distributed Systems, v. 11, n.7, p. 760-768, July 2000.
[15] KOSTIN, A.E.; AYBAY, I.; OZ, G. (2000). A Randomized Contention-Based Load-Balancing
Protocol for a Distributed Multiserver Queuing System. IEEE Transactions on Parallel and
Distributed Systems, v. 11, n.12, p. 1252-1273, December 2000.
[16] FIGUEIRA, S.M.; BERMAN, F. (2001). A Slowdown Model for Applications Executing on Time-
Shared Clusters of Workstations. IEEE Transactions on Parallel and Distributed Systems, v. 12, n.6,
p. 653-670, June 2001.
[17] ZOMAYA, A.Y.; TEH, Y. (2001). Observations on Using Genetic Algorithms for Dynamic Load-
Balancing. IEEE Transactions on Parallel and Distributed Systems, v. 12, n.9, p. 899-911, September
2001.
27
[18] ZHANG, Y. (2001). Impact of Workload and System Parameters on Next Generation Cluster
Scheduling Mechanisms. IEEE Transactions on Parallel and Distributed Systems, v. 12, n.9, p. 967-
985, September 2001.
[19] MITZENMACHER, M. (2001). The Power of Two Choices in Randomized Load Balancing. IEEE
Transactions on Parallel and Distributed Systems, v. 12, n.10, p. 1094-1104, October 2001.
[20] WOODSIDE, C.M.; MONFORTON, G.G. (1993). Fast Allocation of Processes in Distributed and
Parallel Systems. IEEE Transactions on Parallel and Distributed Systems, v. 4, n.2, p. 164-174,
February 1993.
[21] PAGE, I; JACOB, T.; CHERN, E. (1993). Fast Algorithms for Distributed Resource Allocation. IEEE
Transactions on Parallel and Distributed Systems, v. 4, n.2, p. 188-197, February 1993.
[22] HSU, T. et al. (2000). Task Allocation on a Network of Processors. IEEE Transactions on Parallel
and Distributed Systems, v.49, n.12, p.1339-1353, December 2000.
[23] GNU/GCC Supported Platforms.
URL: http://gcc.gnu.org/install/specific.html
Consulted in: June 25th, 2003