222

Large-Scale Computing - ASE Computing... · 2015. 2. 2. · 3.6 Adaptive Mesh Refinement Hydrosimulations / 49 3.7 Physical Multiscale Discrete Simulation at IPE / 49 3.8 Discussion

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Large-Scale Computing

  • WILEY SERIES ON PARALLEL AND DISTRIBUTED COMPUTING

    Editor: Albert Y. Zomaya

    A complete list of titles in this series appears at the end of this volume.

  • Copyright © 2012 by John Wiley & Sons, Inc. All rights reserved.

    Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

    Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

    For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

    Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

    Library of Congress Cataloging-in-Publication Data:

    Large-scale computing techniques for complex system simulations / [edited by] Werner Dubitzky, Krzysztof Kurowski, Bernhard Schott. p. cm.—(Wiley series on parallel and distributed computing ; 80) Includes bibliographical references and index. ISBN 978-0-470-59244-1 (hardback) 1. Computer simulation. I. Dubitzky, Werner, 1958– II. Kurowski, Krzysztof, 1977– III. Schott, Bernhard, 1962– QA76.9.C65L37 2012 003'.3–dc23 2011016571

    eISBN 9781118130476oISBN 9781118130506ePub 9781118130490MOBI 9781118130483

    Printed in United States of America.

    10 9 8 7 6 5 4 3 2 1

    http://www.copyright.comhttp://www.wiley.com/go/permissionshttp://www.wiley.com

  • Contents

    FOREWORD XI

    PREFACE XV

    CONTRIBUTORS XIX

    1. State-of-the-ArtTechnologiesforLarge-ScaleComputing 1Florian Feldhaus, Stefan Freitag, and Chaker El Amrani

    1.1  Introduction  /  11.2  Grid Computing  /  21.3  Virtualization  /  61.4  Cloud Computing  /  8

    1.4.1  Drawbacks of Cloud Computing  /  91.4.2  Cloud Interfaces  /  10

    1.5  Grid and Cloud: Two Complementary Technologies  /  121.6  Modeling and Simulation of Grid and  

    Cloud Computing  /  131.6.1  GridSim and CloudSim Toolkits  /  14

    1.7  Summary and Outlook  /  15References  /  16

    2. Thee-InfrastructureEcosystem:ProvidingLocalSupporttoGlobalScience 19Erwin Laure and Åke Edlund

    2.1  The Worldwide e-Infrastructure Landscape  /  192.2  BalticGrid: A Regional e-Infrastructure, Leveraging on  

    the Global “Mothership” EGEE  /  21

    v

  • vi    CONTENTS

    2.2.1  The BalticGrid Infrastructure  /  212.2.2  BalticGrid Applications: Providing Local Support to  

    Global Science  /  222.2.3  The Pilot Applications  /  232.2.4  BalticGrid’s Support Model  /  25

    2.3  The EGEE Infrastructure  /  252.3.1  The EGEE Production Service  /  262.3.2  EGEE and BalticGrid: e-Infrastructures in Symbiosis  /  28

    2.4  Industry and e-Infrastructures: The Baltic Example  /  292.4.1  Industry and Grids  /  292.4.2  Industry and Clouds, Clouds and e-Infrastructures  /  302.4.3  Clouds: A New Way to Attract SMEs and Start-Ups  /  30

    2.5  The Future of European e-Infrastructures: The European  Grid Initiative (EGI) and the Partnership for Advanced  Computing in Europe (PRACE) Infrastructures  /  312.5.1  Layers of the Ecosystem  /  32

    2.6  Summary  /  33Acknowledgments  /  34References  /  34

    3. AcceleratedMany-CoreGPUComputingforPhysicsandAstrophysicsonThreeContinents 35Rainer Spurzem, Peter Berczik, Ingo Berentzen, Wei Ge, Xiaowei Wang, Hsi-Yu Schive, Keigo Nitadori, Tsuyoshi Hamada, and José Fiestas

    3.1  Introduction  /  363.2  Astrophysical Application for Star Clusters and  

    Galactic Nuclei  /  383.3  Hardware  /  403.4  Software  /  413.5  Results of Benchmarks  /  423.6  Adaptive Mesh Refinement Hydrosimulations  /  493.7  Physical Multiscale Discrete Simulation at IPE  /  493.8  Discussion and Conclusions  /  53Acknowledgments  /  54References  /  54

    4. AnOverviewoftheSimWorldAgent-BasedGridExperimentationSystem 59Matthias Scheutz and Jack J. Harris

  • CONTENTS    vii

    4.1  Introduction  /  594.2  System Architecture  /  624.3  System Implementation  /  67

    4.3.1  Key Components  /  684.3.2  Novel Features in SWAGES  /  69

    4.4  A SWAGES Case Study  /  714.4.1  Research Questions and Simulation Model  /  714.4.2  The Simulation Environment  /  724.4.3  Simulation Runs in SWAGES  /  724.4.4  Data Management and Visualization  /  73

    4.5  Discussion  /  744.5.1  Automatic Parallelization of Agent-Based Models  /  754.5.2  Integrated Data Management  /  764.5.3  Automatic Error Detection and Recovery  /  764.5.4  SWAGES Compared to Other Frameworks  /  76

    4.6  Conclusions  /  78References  /  78

    5. RepastHPC:APlatformforLarge-ScaleAgent-BasedModeling 81Nicholson Collier and Michael North

    5.1  Introduction  /  815.2  Agent Simulation  /  825.3  Motivation and Related Work  /  825.4  From Repast S to Repast HPC  /  90

    5.4.1  Agents as Objects  /  915.4.2  Scheduling  /  915.4.3  Modeling  /  91

    5.5  Parallelism  /  925.6  Implementation  /  94

    5.6.1  Context  /  955.6.2  RepastProcess  /  955.6.3  Scheduler  /  965.6.4  Distributed Network  /  975.6.5  Distributed Grid  /  985.6.6  Data Collection and Logging  /  995.6.7  Random Number Generation and Properties  /  100

    5.7  Example Application: Rumor Spreading  /  1015.7.1  Performance Results  /  103

  • viii    CONTENTS

    5.8  Summary and Future Work  /  107References  /  107

    6. BuildingandRunningCollaborativeDistributedMultiscaleApplications 111Katarzyna Rycerz and Marian Bubak

    6.1  Introduction  /  1116.2  Requirements of Multiscale Simulations  /  112

    6.2.1  Interactions between Single-Scale Models  /  1136.2.2  Interoperability, Composability, and Reuse of  

    Simulation Models  /  1156.3  Available Technologies  /  116

    6.3.1  Tools for Multiscale Simulation Development  /  1166.3.2  Support for Composability  /  1176.3.3  Support for Simulation Sharing  /  118

    6.4  An Environment Supporting the HLA  Component Model  /  1196.4.1  Architecture of the CompoHLA Environment  /  1196.4.2  Interactions within the CompoHLA Environment  /  1206.4.3  HLA Components  /  1226.4.4  CompoHLA Component Users  /  124

    6.5  Case Study with the MUSE Application  /  1246.6  Summary and Future Work  /  127Acknowledgments  /  128References  /  129

    7. Large-ScaleData-IntensiveComputing 131Mark Parsons

    7.1  Digital Data: Challenge and Opportunity  /  1317.1.1  The Challenge  /  1317.1.2  The Opportunity  /  132

    7.2  Data-Intensive Computers  /  1327.3  Advanced Software Tools and Techniques  /  134

    7.3.1  Data Mining and Data Integration  /  1347.3.2  Making Data Mining Easier  /  1357.3.3  The ADMIRE Workbench  /  137

    7.4  Conclusion  /  139Acknowledgments  /  139References  /  139

  • CONTENTS    ix

    8. ATopology-AwareEvolutionaryAlgorithmforReverse-EngineeringGeneRegulatoryNetworks 141Martin Swain, Camille Coti, Johannes Mandel, and Werner Dubitzky

    8.1  Introduction  /  1418.2  Methodology  /  143

    8.2.1  Modeling GRNs  /  1438.2.2  QCG-OMPI  /  1488.2.3  A Topology-Aware Evolutionary Algorithm  /  152

    8.3  Results and Discussion  /  1558.3.1  Scaling and Speedup of the Topology-Aware  

    Evolutionary Algorithm  /  1558.3.2  Reverse-Engineering Results  /  158

    8.4  Conclusions  /  160Acknowledgments  /  161References  /  161

    9. QosCosGride-ScienceInfrastructureforLarge-ScaleComplexSystemSimulations 163Krzysztof Kurowski, Bartosz Bosak, Piotr Grabowski, Mariusz Mamonski, Tomasz Piontek, George Kampis, László Gulyás, Camille Coti, Thomas Herault, and Franck Cappello

    9.1  Introduction  /  163

    9.2  Distributed and Parallel Simulations  /  1659.3  Programming and Execution Environments  /  168

    9.3.1  QCG-OMPI  /  1699.3.2  QCG-ProActive  /  171

    9.4  QCG Middleware  /  1749.4.1  QCG-Computing Service  /  1759.4.2  QCG-Notification and Data Movement Services  /  1769.4.3  QCG-Broker Service  /  177

    9.5  Additional QCG Tools  /  1799.5.1  Eclipse Parallel Tools Platform (PTP) for QCG  /  179

    9.6  QosCosGrid Science Gateways  /  1809.7  Discussion and Related Work  /  182References  /  184

    GLOSSARY 187

    INDEX 195

  • Foreword

    The human desire to understand things is insatiable and perpetually drives the advance of society, industry, and our quality of living. Scientific inquiry and research are at the core of this desire to understand, and, over the years, sci-entists have developed a number of formal means of doing that research. The third means of scientific research, developed in the past 100 years after theory and experimentation, is modeling and simulation. The sharpest tool invented thus far for modeling and simulation has been the computer. It is fascinating to observe that, within the short span of a few decades, humans have devel-oped dramatically larger and larger scale computing systems, increasingly based on massive replication of commodity elements to model and simulate complex phenomena in increasing details, thus delivering greater insights. At this juncture in computing, it looks as if we have managed to turn everything physical into its virtual equivalent, letting a virtual world precede the physical one it reflects so that we can predict what is going to happen before it happens. Better yet, we want to use modeling and simulation to tell us how to change what is going to happen. The thought is almost scary as we are not supposed to do God’s work.

    We all know the early studies of ballistic trajectories and code breaking, which stimulated the development of the first computers. From those begin-nings, all kinds of physical objects and natural phenomena have been captured and reflected in computer models, ranging from particles in a “simple” atom to the creation of the universe, with modeling the earth’s climate in between.

    xi

  • xii    Foreword

    Computer-based modeling systems are not just limited to natural systems, but increasingly, man-made objects are being modeled, as well. One could say that, without computers, there would not have been the modern information, com-munication, and entertainment industries because the heart of these indus-tries’ products, the electronic chips, must be extensively simulated and tested before they are manufactured.

    Even much larger physical products, like cars and airplanes, are also modeled by computers before they go into manufacturing. An airbus virtually assembles all of its planes’ graphically rendered parts every night to make sure they fit together and work, just like a software development project has all its pieces of code compiled and regression tested every night so that the developers can get reports on what they did wrong the day before to fix their problems. Lately, modelers have advanced to simulating even the human body itself, as well as the organizations we create: How do drugs interact with the proteins in our bodies? How can a business operate more efficiently to generate more revenue and profits by optimizing its business processes? The hottest area of enterprise computing applications nowadays is business analytics.

    Tough problems require sharp tools, and imaginative models require inno-vative computing systems. In computing, the rallying cry from the beginning has been “larger is better”: larger computing power, larger memory, larger storage, larger everything. Our desire for more computing capacity has been insatiable. To solve the problems that computer simulations address, computer scientists must be both dreamers and greedy for more computing power at the same time. And it is all for the better: It is valuable to be able to simulate in detail a car with a pregnant woman sitting in it and to model how the side and front air bags will function when surrounded by all the car parts and involved in all kinds of abnormal turns. More accurate results require using more com-puting power, more data storage space, and more interconnect bandwidth to link them all together. In computing, greed is good—if you can afford it.

    That is where this book will start: How can we construct infinitely powerful computing systems with just the right applications to support modeling and simulating problems at a very low cost so they can be accessible to everybody? There is no simple answer, but there are a million attempts. Of course, not all of them lead to real progress, and not all of them can be described between the two covers of a book. This book carefully selected eight projects and enlisted their thinkers to show and tell what they did and what they learned from those experiences. The reality of computing is that it is still just as much an art as it is science: it is the cleverness of how to put together the silicon and the software programs to come up with a system that has the lowest cost while also being the most powerful and easiest to use. These thinkers, like genera-tions of inventors before them, are smart, creative people on a mission.

    There have certainly been many innovations in computer simulation systems, but three in particular stand out: commodity replication, virtualiza-tion, and cloud computing. Each of these will also be explored in this book, although none of these trends have been unique to simulation.

  • Foreword    xiii

    Before commodity replication, the computing industry had a good run of complex proprietary systems, such as the vector supercomputers. But when it evolved to 30 layers in a printed circuit board and a cost that could break a regional bank, it had gone too far. After that, the power was in the replication of commodity components like the x86 chips used in PCs, employing a million of them while driving a big volume discount, then voilà, you get a large-scale system at a low(er) cost! That is what drove computing clusters and grids.

    The complexity animal was then tamed via software through virtualization, which abstracts away the low-level details of the system components to present a general systems environment supporting a wide array of parallel program-ming environments and specialized simulations. Virtualization allows innova-tion in computer system components without changing applications to fit the systems. Finally, cloud computing may allow users not to have to actually own large simulation computers anymore but rather to only use computing resources as needed, sharing such a system with other users. Or even better, cloud resources can be rented from a service provider and users only pay for usage. That is an innovation all right, but it feels like we are back to the main-frame service bureau days. In this book, you will learn about many variations of innovations around these three themes as applied to simulation systems.

    So, welcome to a fascinating journey into the virtual world of simulation and computing. Be prepared to be perplexed before possibly being enlight-ened. After all, simulation is supposed to be the state of the art when it comes to complex and large-scale computing. As they say, the rewards make the journey worthwhile.

    Songnian ZhouToronto, Canada

    March 2011

  • Preface

    Complex systems are defined as systems with many interdependent parts that give rise to nonlinear and emergent properties determining their high-level functioning and behavior. Due to the interdependence of their constituent elements and other characteristics of complex systems, it is difficult to predict system behavior based on the “sum of their parts” alone. Examples of complex systems include human economies and societies, nervous systems, molecular interaction networks, cells and other living things, such as bees and their hives, and ecosystems, as well as modern energy and telecommunication infrastruc-tures. Arguably, one of the most striking properties of complex systems is that conventional experimental and engineering approaches are inadequate to capture and predict the behavior of such systems. A relatively recent and more holistic approach employs computational techniques to model and simulate complex natural phenomena and complex man-made artifacts. Complex system simulations typically require considerable computing and storage resources (processing units and primary and secondary memory) as well as high-speed communication links. Supercomputers are the technology of choice to satisfy these requirements. Because supercomputers are expensive to acquire and maintain, there has been a trend to exploit distributed computing and other large-scale computing technologies to facilitate complex system simulations. Grid computing, service-oriented architectures, programmable logic arrays, and graphic processors are examples of such technologies.

    xv

  • xvi    Preface

    The purpose of this volume is to present a representative overview of con-temporary large-scale computing technologies in the context of complex system simulation applications. The book is intended to serve simultaneously as design blueprint, user guide, research agenda, and communication platform. As a design blueprint, the book is intended for researchers and technology and application developers, managers, and other professionals who are tasked with the development or deployment of large-scale computer technology to facili-tate complex system applications. As a user guide, the volume addresses the requirements of modelers and scientists to gain an overview and a basic under-standing of key concepts, methodologies, technologies, and tools. For this audi-ence, we seek to explain the key concepts and assumptions of the various large-scale computer techniques and technologies, their conceptual and com-putational merits and limitations. We aim at providing the users with a clear understanding and practical know-how of the relevant technologies in the context of complex system modeling and simulation and the large-scale com-puting technologies employed to meet the requirements of such applications. As research agenda, the book is intended for computer and complex systems students, teachers, and researchers who seek to understand the state of the art of the large-scale computing technologies involved as well as their limitations and emerging and future developments. As a communication platform, the book is intended to bridge the cultural, conceptual, and technological gap among the key disciplines of complex system modeling and simulation and large-scale computing. To support this goal, we have asked the contributors to adopt an approach that appeals to audiences from different backgrounds.

    Clearly, we cannot expect to do full justice to all of these goals in a single book. However, we do believe that this book has the potential to go a long way in fostering the understanding, development, and deployment of large-scale computer technology and its application to the modeling and simulation of complex systems. Thus, we hope this volume will contribute to increased communication and collaboration across various modeling, simulation, and computer science disciplines and will help to improve the complex natural and engineering systems.

    This volume comprises nine chapters, which introduce the key concepts and challenges and the lessons learned from developing and deploying large-scale computing technologies in the context of complex system applications. Next, we briefly summarize the contents of the nine chapters.

    Chapter 1 is concerned with an overview of some large-scale computing technologies. It discusses how in the last three decades the demand for com-puter-aided simulation of processes and systems has increased. In the same time period, simulation models have become increasingly complex in order to capture the details of the systems and processes being modeled. Both trends have instigated the development of new concepts aimed at a more efficient sharing of computational resources. Typically, grid and cloud computing tech-niques are employed to meet the computing and storage demands of complex applications in research, development, and other areas. This chapter provides

  • Preface    xvii

    an overview of grid and cloud computing, which are key elements of many modern large-scale computing environments.

    Chapter 2 adopts the view of an e-infrastructure ecosystem. It focuses on scientific collaborations and how these are increasingly relying on the capabil-ity of combining computational and data resources supplied by several resource providers into seamless e-infrastructures. This chapter presents the rationale for building an e-infrastructure ecosystem that comprises national, regional, and international e-infrastructures. It discusses operational and usage models and highlights how e-infrastructures can be used in building complex applications.

    Chapter 3 presents multiscale physical and astrophysical simulations on new many-core accelerator hardware. The chosen algorithms are deployed on parallel clusters using a large number of graphical processing units (GPUs) on the petaflop scale. The applications are particle-based astrophysical many-body simulations with self-gravity, as well as particle and mesh-based simula-tions on fluid flows, from astrophysics and physics. Strong and soft scaling are demonstrated using some of the fastest GPU clusters in China and hardware resources of cooperating teams in Germany and the United States.

    Chapter 4 presents an overview of the SimWorld Agent-Based Grid Experimentation System (SWAGES). SWAGES has been used extensively for various kinds of agent-based modeling and is designed to scale to very large and complex grid environments while maintaining a very simple user interface for integrating models with the system. This chapter focuses on SWAGES’ unique features for parallel simulation experimentation (such as novel spatial scheduling algorithms) and on its methodologies for utilizing large-scale com-putational resources (such as the distributed server architecture designed to offset the ever-growing computational demands of administering large simula-tion experiments).

    Chapter 5 revolves around agent-based modeling and simulation (ABMS) technologies. In the last decade, ABMS has been successfully applied to a variety of domains, demonstrating the potential of this approach to advance science, engineering, and other domains. However, realizing the full potential of ABMS to generate breakthrough research results requires far greater com-puting capability than is available through current ABMS tools. The Repast for High Performance Computing (Repast HPC) project addresses this need by developing a next-generation ABMS system explicitly focusing on larger-scale distributed computing platforms. This chapter’s contribution is its detailed presentation of the implementation of Repast HPC, a complete ABMS plat-form developed explicitly for large-scale distributed computing systems.

    Chapter 6 presents an environment for the development and execution of multiscale simulations composed from high-level architecture (HLA) compo-nents. Advanced HLA mechanisms are particularly useful for multiscale simu-lations as they provide, among others, time management functions that enable the construction of integrated simulations from modules with different indi-vidual timescales. Using the proposed solution simplifies the use of HLA

  • xviii    Preface

    services and allows components to be steered by users; this is not possible in raw HLA. This solution separates the roles of simulation module developers from those of users and enables collaborative work. The environment is acces-sible via a scripting API, which enables the steering of distributed components using straightforward source code.

    Chapter 7 is concerned with the data dimensions of large-scale computing. Data-intensive computing is the study of the tools and techniques required to manage and explore digital data. This chapter briefly discusses the many issues arising from the huge increase in stored digital data that we are now con-fronted with globally. In order to make sense of this data and to transform it into useful information that can inform our knowledge of the world around us, many new techniques in data handling, data exploration, and information creation are needed. The Advanced Data Mining and Integration Research for Europe (ADMIRE) project, which this chapter discusses in some detail, is studying how some of these challenges can be addressed through the cre-ation of advanced, automated data mining techniques that can be applied to large-scale distributed data sets.

    Chapter 8 describes a topology-aware evolutionary algorithm that is able to automatically adapt itself to different configurations of distributed comput-ing resources. An important component of the algorithm is the use of QosCosGrid-OpenMPI, which enables the algorithm to run across computing resources hundreds of kilometers distant from one another. The authors use the evolutionary algorithm to compare models of a biological gene regulatory network which have been reverse engineered using three different systems of mathematical equations.

    Chapter 9 presents a number of technologies that have been successfully integrated into a supercomputing-like e-science infrastructure called QosCosGrid (QCG). The solutions provide services for simulations such as complex systems, multiphysics, hybrid models, and parallel applications. The key aim in providing these solutions was to support the dynamic and guaran-teed use of distributed computational clusters and supercomputers managed efficiently by a hierarchical scheduling structure involving a metascheduler layer and underlying local queuing or batch systems.

    Werner DubitzkyKrzysztof Kurowski

    Bernhard Schott

    Coleraine, Frankfurt, PoznanMay 2011

  • Contributors

    Peter Berczik,  National Astronomical  Observatories  of  China,  Chinese Academy of Sciences, China, and Astronomisches Rechen-Institut, Zentrum für Astronomie, University of Heidelberg, Germany; Email: [email protected]

    ingo Berentzen,  Zentrum  für  Astronomie,  University  of  Heidelberg, Heidelberg, Germany; Email: [email protected]

    Bartosz Bosak,  Poznan Supercomputing and Networking Center, Poznan, Poland; Email: [email protected]

    Marian BuBak,  AGH  University  of  Science  and  Technology,  Krakow, Poland,  and  Institute  for  Informatics,  University  of  Amsterdam,  The Netherlands; Email: [email protected]

    Franck caPPello,  National  Institute  for Research  in Computer Science and Control (INRIA), Rennes, France; Email: [email protected]

    nicholson collier,  Argonne  National  Laboratory, Argonne  IL;  Email: [email protected]

    caMille coti,  LIPN,  CNRS-UMR7030,  Université  Paris  13,  F-93430 Villetaneuse, France; Email: [email protected]

    Werner DuBitzky,  School  of  Biomedical  Sciences,  University  of  Ulster, Coleraine BT52 1SA, UK; Email: [email protected]

    Åke eDlunD, KTH  Royal  Institute  of  Technology,  Stockholm,  Sweden; Email: [email protected]

    xix

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]

  • xx    Contributors

    chaker el aMrani,  Université  Abdelmalek  Essaâdi,  Tanger,  Morocco; Email: [email protected]

    Florian FelDhaus,  Dortmund  University  of  Technology,  Dortmund, Germany; Email: [email protected]

    José Fiestas,  Astronomisches  Rechen-Institut,  Zentrum  für  Astronomie, University of Heidelberg, Germany; Email: [email protected]

    steFan Freitag,  Dortmund University of Technology, Dortmund, Germany; Email: [email protected]

    Wei ge,  Institute  of  Process  Engineering,  Chinese Academy  of  Sciences, Beijing, China; Email: [email protected]

    Piotr graBoWski,  Poznan  Supercomputing  and  Networking  Center, Poznan, Poland; Email: [email protected]

    lászló gulyás,  Aitia International Inc. and Collegium Budapest (Institute for Advanced Study), Budapest, Hungary; Email: [email protected]

    tsuyoshi haMaDa,  Nagasaki  Advanced  Computing  Center,  Nagasaki University, Nagasaki, Japan; Email: [email protected]

    Jack J. harris,  Human Robot Interaction Laboratory, Indiana University, Bloomington, IN; Email: [email protected]

    thoMas herault,  National  Institute  for  Research  in  Computer  Science and Control (INRIA), Rennes, France; Email: [email protected]

    george kaMPis,  Collegium  Budapest  (Institute  for  Advanced  Study), Budapest, Hungary; Email: [email protected]

    krzysztoF kuroWski,  Poznan  Supercomputing  and  Networking  Center, Poznan, Poland; Email: [email protected]

    erWin laure,  KTH  Royal  Institute  of  Technology,  Stockholm,  Sweden; Email: [email protected]

    Mariusz MaMonski,  Poznan  Supercomputing  and  Networking  Center, Poznan, Poland; Email: [email protected]

    Johannes ManDel,  Roche Diagnostics GmbH, Penzberg, Germany; Email: [email protected]

    keigo nitaDori,  RIKEN AICS Institute, Kobe, Japan; Email: [email protected]

    Michael north, Argonne National Laboratory, Argonne IL; Email: [email protected]

    Mark Parsons,  EPCC,  The  University  of  Edinburgh,  Edinburgh,  UK; Email: [email protected]

    toMasz Piontek,  Poznan Supercomputing and Networking Center, Poznan, Poland; Email: [email protected]

    katarzyna rycerz,  AGH University of Science and Technology, Krakow, Poland, and ACC Cyfronet AGH, Krakow, Poland; Email: [email protected]

    Matthias scheutz,  Department  of  Computer  Science,  Tufts  University, Medford, MA; Email: [email protected]

    hsi-yu schive,  Department of Physics, National Taiwan University, Taibei, Taiwan; Email: [email protected]

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]

  • Contributors    xxi

    rainer sPurzeM,  National Astronomical Observatories of China, Chinese Academy of Sciences, China; Astronomisches Rechen-Institut, Zentrum für Astronomie,  University  of  Heidelberg,  Germany;  and  Kavli  Institute  for Astronomy and Astrophysics, Peking University, China; Email: [email protected]

    Martin sWain,  Institute of Biological, Environmental and Rural Sciences, Aberystwyth  University,  Penglais,  Aberystwyth,  Ceredigion,  UK;  Email: [email protected]

    XiaoWei Wang,  Institute  of  Process  Engineering,  Chinese  Academy  of Sciences, Beijing, China; Email: [email protected]

    mailto:[email protected]:[email protected]:[email protected]:[email protected]

  • Chapter 1State-of-the-Art

    Technologies for Large-Scale Computing

    Florian Feldhaus and Stefan FreitagDortmund University of Technology, Dortmund, Germany

    Large-Scale Computing, First Edition. Edited by Werner Dubitzky, Krzysztof Kurowski, Bernhard Schott.© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc.

    1.1  INTRODUCTION

    Within the past few years, the number and complexity of computer-aided simula-tions in science and engineering have seen a considerable increase. This increase is not limited to academia as companies and businesses are adding modeling and simulation to their repertoire of tools and techniques. Computer-based simula-tions often require considerable computing and storage resources. Initial approaches to address the growing demand for computing power were realized with supercomputers in 60 seconds. Around 1964, the CDC6600 (a mainframe computer from Control Data Corporation) became available and offered a peak performance of approximately 3 × 106 floating point operations per second (flops) (Thornton, 1965). In 2008, the IBM Roadrunner1 system, which offers a peak per-formance of more than 1015 flops, was commissioned into service. This system was leading the TOP500 list of supercomputers2 until November 2009.

    1

    2 www.top500.org.

    1 www.lanl.gov/roadrunner.

    Chaker El AmraniUniversité Abdelmalek Essaâdi, Tanger, Morocco

    http://www.top500.orghttp://www.lanl.gov/roadrunner

  • 2    State-of-the-art technologieS for large-Scale computing

    Supercomputers are still utilized to execute complex simulations in a rea-sonable amount of time, but can no longer satisfy the fast-growing demand for computational resources in many areas. One reason why the number of avail-able supercomputers does not scale proportional to the demand is the high cost of acquisition (e.g., $133 million for Roadrunner) and maintenance.

    As conventional computing hardware is becoming more powerful (process-ing power and storage capacity) and affordable, researchers and institutions that cannot afford supercomputers are increasingly harnessing computer clus-ters to address their computing needs. Even when a supercomputer is available, the operation of a local cluster is still attractive, as many workloads may be redirected to the local cluster and only jobs with special requirements that outstrip the local resources are scheduled to be executed on the supercomputer.

    In addition to current demand, the acquisition of a cluster computer for processing or storage needs to factor in potential increases in future demands over the computer’s lifetime. As a result, a cluster typically operates below its maximum capacity for most of the time. E-shops (e.g., Amazon) are normally based on a computing infrastructure that is designed to cope with peak work-loads that are rarely reached (e.g., at Christmas time).

    Resource providers in academia and commerce have started to offer access to their underutilized resources in an attempt to make better use of spare capacity. To enable this provision of free capacity to third parties, both kinds of provider require technologies to allow remote users restricted access to their local resources. Commonly employed technologies used to address this task are grid computing and cloud computing. The concept of grid computing originated from academic research in the 1990s (Foster et al., 2001). In a grid, multiple resources from different administrative domains are pooled in a shared infrastructure or computing environment. Cloud computing emerged from commercial providers and is focused on providing easy access to resources owned by a single provider (Vaquero et al., 2009).

    Section 1.2 provides an overview of grid computing and the architecture of grid middleware currently in use. After discussing the advantages and draw-backs of grid computing, the concept of virtualization is briefly introduced. Virtualization is a key concept behind cloud computing, which is described in detail in Section 1.4. Section 1.5 discusses the future and emerging synthesis of grid and cloud computing before Section 1.7 summarizes this chapter and provides some concluding remarks.

    1.2  GRID COMPUTING

    Foster (2002) proposes three characteristics of a grid:

    1. Delivery of nontrivial qualities of service2. Usage of standard, open, general-purpose protocols and interfaces3. Coordination of resources that are not subject to centralized control

  • grid computing    3

    Endeavors to implement solutions addressing the concept of grid comput-ing ended up in the development of grid middleware. This development was and still is driven by communities with very high demands for computing power and storage capacity. In the following, the main grid middleware con-cepts are introduced and their implementation is illustrated on the basis of the gLite3 middleware, which is used by many high-energy physics research insti-tutes (e.g., CERN). Other popular grid middleware include Advanced Resource Connector (ARC4), Globus Toolkit,5 National Research Grid Initiative (NAREGI6), and Platform LSF MultiCluster.7

    Virtual Organizations. A central concept of many grid infrastructures is a virtual organization. The notion of a virtual organization was first mentioned by Mowshowitz (1997) and was elaborated by Foster et al. (2001) as “a set of the individuals and/or institutions defined by resource sharing rules.”

    Virtual organization is used to overcome the temporal and spatial limits of conventional organizations. The resources shared by a virtual organization are allowed to change dynamically: Each participating resource/institution is free to enter or leave the virtual organization at any point in time. One or more resource providers can build a grid infrastructure by using grid middleware to offer computing and storage resources to multiple virtual organizations. Resources at the same location (e.g., at an institute or computing center) are forming a (local) grid site. Each grid site offers its resources through grid middleware services to the grid. For the management and monitoring of the grid sites as well as the virtual organizations, central services are required. The main types of service of a grid middleware may be categorized into (Fig. 1.1) (Foster, 2005; Burke et al., 2009) the following:

    • Execution management• Data management• Information services• Security

    Execution Management. The execution management services deal with monitoring and controlling compute tasks. Users submit their compute tasks together with a description of the task requirements to a central workload management system (WMS). The WMS schedules the tasks according to their requirements to free resources discovered by the information system. As there may be thousands of concurrent tasks to be scheduled by the WMS, sophisti-cated scheduling mechanisms are needed. The simulation and analysis of the

    7 www.platform.com.

    6 www.naregi.org/link/index_e.html.

    5 www.globus.org/toolkit.

    4 www.nordugrid.org/middleware.

    3 http://glite.cern.ch/.

    http://www.platform.comhttp://www.naregi.org/link/index_e.htmlhttp://www.globus.org/toolkithttp://www.nordugrid.org/middlewarehttp://glite.cern.ch/

  • http://www.dcache.orghttp://castor.web.cern.ch/

  • grid computing    5

    various protocols, for example, dcap, xrootd,13 gridFTP, and SRM (Badino et al., 2009). The LCG storage element defines the minimum set of protocols that have to be supported to access the storage services.

    For gLite, the central LCG File Catalog enables virtual organizations to create a uniform name space for data and to hide the physical data location. This is achieved by using logical file names that are linked to one or more physical file names (consisting of the full qualified name of the storage element and the absolute data path for the file on this specific storage element). Replicas of the data can be created by copying the physical files to multiple storage elements and by registering them under one unique logical file name in the LCG File Catalog. Thus, the risk of data loss can be reduced.

    Information System. The information system discovers and monitors resources in the grid. Often the information system is organized hierarchically. The information about local resources is gathered by a service at each grid site and then sent to a central service. The central service keeps track of the status of all grid services and offers an interface to perform queries. The WMS can query the information to match resources to compute tasks.

    For gLite, a system based on the Lightweight Directory Access Protocol, is used. The Berkeley Database Information Index (BDII) service runs at every site (siteBDII) and queries all local gLite services in a given time interval. In the same way, the siteBDII is queried by a TopLevelBDII, which stores the information of multiple grid resources. The list of available resources is kept in the information supermarket and is updated periodically by the TopBDII.

    To account for the use of resources, a grid middleware may offer accounting services. Those register the amount of computation or storage used by indi-vidual or groups of users or virtual organizations. This allows billing mecha-nisms to be implemented and also enables the execution management system to schedule resources according to use history or quota.

    The gLite Monitoring System Collector Server service gathers accounting data on the local resources and publishes these at a central service.

    Security. To restrict the access of resources to certain groups of users or virtual organizations, authentication and authorization rules need to be enforced. To facilitate this, each virtual organization issues X.509 certificates to its users (user certificate) and resources (host certificate). Using the certifi-cates, users and resources can be authenticated.

    To allow services to operate on behalf of a user, it is possible to create proxy certificates. Those are created by the user and are signed with his user certifi-cate. The proxy certificate usually only has a short lifetime (e.g., 24 hours) for security reasons. A service can the use the proxy certificate to authenticate against some other service on behalf of the user.

    Authorization is granted according to membership in a virtual organization, a virtual organization group, or to single users. Access rights are managed centrally by the virtual organization for its users and resources.

    13 http://xrootd.slac.stanford.edu/.

    http://xrootd.slac.stanford.edu/

  • 6    State-of-the-art technologieS for large-Scale computing

    To manage authorization information, gLite offers the Virtual Organization Management Service, which stores information on roles and privileges of users within a virtual organization. With the information from the Virtual Organization Management Service, a grid service can determine the access rights of individual users.

    1.2.1  Drawbacks in Grid Computing

    A grid infrastructure simplifies the handling of distributed, heterogeneous resources as well as users and virtual organizations. But despite some effort in this direction, the use of inhomogeneous software stacks (e.g., operating system, compiler, libraries) on different resources has been a weakness of grid systems.

    Virtualization technology (see Section 1.4) offers solutions for this problem through an abstraction layer above the physical resources. This allows users, for example, to submit customized virtual machines containing their individual software stack. As grid middleware is not designed with virtualization in mind, the adaptation and adoption of virtualization technology into grid technology is progressing slowly. In contrast to grid computing, cloud computing (Section 1.5) was developed based on virtualization technology. Approaches to combine cloud and grid computing are presented in Section 1.6.

    1.3  VIRTUALIZATION

    As described in Section 1.2, grid computing is concerned mainly with secure sharing of resources in dynamic computing environments. Within the past few years, virtualization emerged and soon became a key technology, giving a new meaning to the concept of resource sharing.

    This section describes two different types of technology (resource and plat-form virtualization) and provides an overview of their respective benefits.

    Network Virtualization. At the level of the network layer, virtualization is categorized as internal and external virtualizations. External virtualization joins network partitions to a single virtual unit. Examples for this type of virtualization are virtual private networks and virtual local area networks (IEEE Computer Society, 2006). Internal virtualization offers network-like functionality to software containers on a resource. This type of virtualization is often used in combination with virtual machines.

    Storage Virtualization. In mass storage systems, the capacities of physical storage devices are often pooled into a single virtual storage device, which is accessible by a consumer via the virtualization layer. Figure 1.2 depicts the layered structure of a simple storage model.

    For storage systems, virtualization is a means by which multiple physical storage devices are viewed as a single logical unit. Virtualization can be accom-plished in two ways at server level, fabric level, storage subsystem level, and file system level: in-band and out-of-band virtualization (Tate, 2003). In this

  • cloud computing    9

    SaaS. This means that software is made available on demand to the end user by a distributed or decentralized cloud computing environment instead of a local computer. Depending on the available infrastructure and technology, one of two methods is applied to provide the software. The conventional method uses a terminal server and one or more clients. The software is installed at the server site. For access, users connect via the client to the server. The contemporary method to provide a SaaS is to deliver the service as part of a Web application; hence, access is normally through a Web browser.

    Keeping the software at a central repository reduces the total overhead for maintenance (e.g., installation and updating); instead of several software installations, only one is maintained. The use of SaaS requires a reliable and fast link to the cloud provider. In some cases, a local installation is still prefer-able to improve performance.

    One of the established SaaS providers is Google. Google Docs,14 for example, offers an online word processing and spreadsheet software.

    PaaS. PaaS environments (e.g., Google App Engine15) are used for the development and deployment of applications while avoiding the need to buy and manage the physical infrastructure. Compared to SaaS, PaaS provides facilities like database integration, application versioning, persistence, scal-ability, and security. Together, these services support the complete life cycle of developing and delivering Web applications and services available entirely through the Internet.

    IaaS. IaaS defines the highest service level in privacy and reliability that resource providers currently offer their customers. In this case, the term infra-structure includes virtual and physical resources.

    With respect to the interfaces offered by a cloud, a distinction is made between compute clouds and storage clouds. Compute cloud providers employ, for example, virtualization and by doing so offer easy access to remote com-puting power. In contrast to this, storage clouds focus on the persistent storage of data. Most compute clouds also offer interfaces to a storage facility which can be used to upload virtual appliances.

    To satisfy the increasing demand for dynamic resource provisioning, the number of cloud providers is increasing steadily. Established providers are Amazon EC2,16 FlexiScale,17 and ElasticHosts.18

    1.4.1  Drawbacks of Cloud Computing

    At the peak of its hype, cloud computing was the proclaimed successor of grid computing. Unfortunately, cloud computing suffers from the same problems

    18 www.elastichosts.com.

    17 www.flexiant.com/products/flexiscale/.

    16 http://aws.amazon.com/ec2/.

    15 http://code.google.com/intl/de/appengine/.

    14 http://docs.google.com.

    http://www.elastichosts.comhttp://www.flexiant.com/products/flexiscale/http://aws.amazon.com/ec2/http://code.google.com/intl/de/appengine/http://docs.google.com

  • 10    State-of-the-art technologieS for large-Scale computing

    as grid computing—the difference is that in the cloud paradigm, the problems are located closer to the hardware layer.

    The resource broker in a grid matches the job requirements (e.g., operating system and applications) with available resources and selects the most ade-quate resource for job19 submission. In the context of cloud computing, the understanding of the term job must be revised to include virtual appliances. Nevertheless, a cloud customer is interested in deploying his job at the most adequate resource, so a matchmaking process is required. Without the exis-tence of a cloud resource broker, the matchmaking is carried out manually.

    A consequence caused by a missing cloud resource broker and by the natural human behavior to prefer favorite service providers often leads to a vendor lock-in.20 The severity of the vendor lock-in problem is increased by proprietary and hence incompatible platform technologies at cloud provider level. For customers with data-intensive applications, it is difficult and expen-sive to migrate to a different cloud provider.21 With the vendor lock-in, the customers strongly depend on the provided quality of service in a cloud (e.g., availability, reliability, and security). If a cloud suffers from poor availability, this implies poor availability of the customer’s services.

    A problem already present in grid computing is the limited network band-width between the data location and the computing resource. To bypass this bottleneck, commercial cloud providers started to physically mail portable storage devices to and from customers. For truly massive volumes of data, this “crude” mode of transferring data is relatively fast and cost-effective.22

    1.4.2  Cloud Interfaces

    A few years ago, standardization became one of the key factors in grid com-puting. Middleware like gLite and UNICORE have adopted open standards such as the Open Grid Services Architecture (OGSA) (Basic Execution Services) (Foster et al., 2008) and the Job Submission Description Language (JSDL) (Anjomshoaa et al., 2005). For cloud computing, it is not apparent if providers are interested in creating a standardized API. This API would ease the migration of data and services among cloud providers and results in a more competitive market.

    Analyzing APIs of various cloud providers reveals that a common standard is unattainable in the foreseeable future. Many providers offer an API similar to the one of Amazon EC2 because of its high acceptance among customers.

    At the moment, API development goes into two directions. The first direc-tion is very similar to the developments for platform virtualization: The overlay

    22 http://aws.amazon.com/importexport/.

    21 Data transfers within a cloud are usually free, but in- and outgoing traffic are charged.

    20 A vendor lock-in makes a customer dependent on a vendor for products and services, unable to use another vendor without substantial switching costs.

    19 A grid job is a binary executable or command to be submitted and run in a remote resource (machine) called server or “gatekeeper.”

    http://aws.amazon.com/importexport/

  • cloud computing    11

    API libvirt23 was created, which supports various hypervisors (e.g., Xen, KVM, VMware, and VirtualBox). For cloud computing, there is the libcloud project24; the libcloud library hides inhomogeneous APIs of cloud providers (e.g., Slicehost, Rackspace, and Linode) from the user. The second direction tends toward a commonly accepted, standardized API for cloud providers. This API would make the development of libcloud needless.

    The open cloud computing interface (OCCI) is one of the proposed API standards for cloud providers. It is targeted at providing a common interface for the management of resources hosted in IaaS clouds. OCCI allows resource aggregators to use a single common interface to multiple cloud resources and customers to interact with cloud resources in an ad hoc way (Edmonds et al., 2009).

    A client encodes the required resources in a URL (see Fig. 1.5). Basic operations for resource modifications (create, retrieve, update, and delete) are mapped to the corresponding http methods POST, GET, PUT, and DELETE (Fielding et al., 1999). For example, a POST request for the creation of a com-puting resource looks similar to

    POST /compute HTTP/1.1Host: one.grid.tu-dortmund.deContent-Length: 36Content-Type: application/x-www-form-urlencodedcompute.cores=8&compute.memory=16384

    Figure 1.5  open cloud computing interface api.

    24 http://libcloud.apache.org.

    23 http://libvirt.org/index.html.

    http://libcloud.apache.orghttp://libvirt.org/index.html

  • 12    State-of-the-art technologieS for large-Scale computing

    Issuing this request triggers the creation of a virtual machine consisting of 8 cores and 16 GB of memory. Other attributes that can be requested are the CPU architecture (e.g., x86), a valid DNS name for the host, and the CPU speed in gigahertz.

    The provision of storage (e.g., a virtual hard disk drive) via OCCI requires the specification of the storage size in gigabyte. Users are able to query the status (online, off-line, or “degraded”) of the virtual storage, to back up, to resize, or for snapshot creation.

    1.5  GRID AND CLOUD: TWO COMPLEMENTARY TECHNOLOGIES

    Both grid and cloud computing are state-of-the-art technologies for large-scale computing. Grid and cloud computing could be viewed as mutually comple-mentary technologies. Grid middleware is not likely to be replaced by cloud middleware because a cloud typically encompasses resources of only a single provider, while a grid spans resources of multiple providers. Nevertheless, compared with a grid resource, a cloud resource is more flexible because it adapts dynamically to the user requirements. Not surprisingly, one of the first “scientific” tasks clouds were used for was the deployment of virtual appli-ances containing grid middleware services. In the first days of cloud computing, only public, pay-per-use clouds (e.g., Amazon EC2) were available. Therefore, this endeavor was carried out only (1) to show its feasibility and (2) for bench-marking cloud capabilities of commercial providers. The added value of cloud computing was enough stimulus to develop open source computing and storage cloud solutions (e.g., OpenNebula,25 Eucalyptus26).

    With grid and cloud computing to their disposal, national and international e-science initiatives (e.g., D-Grid27) are currently reviewing their activities, which are mainly focused on grid computing. One aspect of D-Grid is to provide discipline-specific environments for collaboration and resource sharing to researchers. Independent from the discipline, a common core technology should be utilized. With the advent of cloud computing and its success in recent years, it is planned to be added to the already existing core technology, namely, grid computing.

    In this context, two trends toward interoperation of grids and clouds are emerging. The first is implied by the lessons learned so far from grid computing and refers to the creation of a grid of clouds. Similar to single computing and storage resources gathered in a grid, computing and storage resources of a cloud will become part of a larger structure. As briefly shown in Section 1.4, additional value-added services like a cloud resource broker can be part of such a structure. In a possible grid of cloud setup, an information system peri-

    27 www.d-grid.de.

    26 http://open.eucalyptus.com/.

    25 http://opennebula.org/.

    http://www.d-grid.dehttp://open.eucalyptus.com/http://opennebula.org/

  • modeling and Simulation of grid and cloud computing     13

    odically queries cloud resources, for example, health status, pricing informa-tion, and free capacities. This information is used by a cloud broker to find the most adequate resource to satisfy the user’s needs. The specification of a common cloud API standard would ease the task of creating such a cloud broker.

    The second identified trend started with a gap analysis. A set of components required for the seamless integration of cloud resources into existing grid infrastructures will be the result of this analysis. Using D-Grid as an example for a national grid initiative, work in the fields of user management and authentication, information systems, and accounting have been identified. Usually, authentication in a grid is based on a security infrastructure using X.509 certificates and/or short-lived credentials. Some commercial cloud pro-viders offer a similar interface accepting certificates, but often, a username/password combination (e.g., EC2_ACCESS_KEY and EC2_SECRET_KEY for Amazon EC2) is employed for authentication.

    Concerning the information systems, in D-Grid, each of the supported grid middlewares runs a separate one. The gLite middleware uses a combination of site and top-level BDII; (Web-)MDS is used in Globus Toolkit; and Common Information Service (CIS) is used in UNICORE6. The information provided by these systems is collected and aggregated by D-MON, a D-Grid monitoring service. D-MON uses an adapter for each type of supported grid middleware that converts the data received to an independent data format for further processing. To integrate information provided by a cloud middleware, a new adapter and the definition of measurement categories need to be developed.

    After the creation of the missing components, grid and clouds are likely to coexist as part of e-science infrastructures.

    1.6  MODELING AND SIMULATION OF GRID AND  CLOUD COMPUTING

    Today, modeling and simulation of large-scale computing systems are consid-ered as a fruitful R&D area for algorithms and applications. With the increasing complexity of such environments, simulation technology has experienced great changes and has dealt with multiplatform distributed simulation, joining per-formance, and structure. A high-performance simulator is necessary to investi-gate various system configurations and to create different application scenarios before putting them into practice in real large-scale distributed computing systems. Consequently, simulation tools should enable researchers to study and evaluate developed methods and policies in a repeatable and controllable environment and to tune algorithm performance before deployment on opera-tional systems. A grid or a cloud simulator has to be able to model heteroge-neous computational components; to create resources, a network topology, and users; to enable the implementation of various job scheduling algorithms; to manage service pricing; and to output statistical results.

  • 14    State-of-the-art technologieS for large-Scale computing

    The existing discrete-event simulation studies cover a small number of grid sites, with only a few hundreds of hosts, as well as large-scale use case grids, including thousands or millions of hosts. Common simulation studies deal with job scheduling and data replication algorithms to achieve better performance and high availability of data.

    A list of grid simulation tools that implement one or more of the above-mentioned functionalities include the following:

    1. OptorSim (Bell et al., 2002) is being developed as part of the EU DataGrid project. It mainly emphasizes grid replication strategies and optimization.

    2. SimGrid toolkit (Casanova, 2001), developed at the University of California at San Diego, is a C-based simulator for scheduling algorithms.

    3. MicroGrid emulator (Song et al., 2000), developed at the University of California at San Diego, can be used for building grid infrastructures. It allows to execute applications, created using the Globus Toolkit, in a virtual grid environment.

    4. GangSim (Dumitrescu and Foster, 2005), developed at the University of Chicago, aims to study the usage and scheduling policies in a multivirtual organization environment and is able to model real grids of consider-able size.

    1.6.1  GridSim and CloudSim Toolkits

    GridSim (Sulistio et al., 2008) is a Java-based discrete-event grid simulation toolkit, developed at the University of Melbourne. It enables modeling and simulation of heterogeneous grid resources, users, and applications. Users can customize algorithms and workload.

    At the lowest layer of the GridSim architecture operates the SimJava (Howell and McNab, 1998) discrete-event simulation engine. It provides GridSim with the core functionalities needed for higher-level simulation sce-narios, such as queuing and processing of events, creation of system compo-nents, communication between components, and management of the simulation clock.

    The GridSim toolkit enables to run reproducible scenarios that are not possible in a real grid infrastructure. It supports high-level software compo-nents for modeling multiple grid infrastructures and basic grid components such as the resources, workload traces, and information services. GridSim enables the modeling of different resource characteristics. Therefore, it allo-cates incoming jobs based on space or time-shared mode. It is able to schedule compute- and data-intensive jobs; it allows easy implementation of different resource allocation algorithms; it supports reservation-based or auction mech-anisms for resource allocation; it enables simulation of virtual organization

  • Summary and outlook    15

    scenarios; it is able to visualize tracing sequences of simulation execution; and it bases background network traffic characteristics on a probabilistic distribu-tion (Buyya and Murshed, 2002).

    CloudSim (Buyya et al., 2009) is a framework, also developed at the University of Melbourne, that enables modeling, simulation, and experiment-ing on cloud computing infrastructures. It is built on top of GridSim (Fig. 1.6).

    CloudSim is a platform that can be used to model data centers, service brokers, scheduling algorithms, and allocation policies of a large-scale cloud computing platform. It supports the creation of virtual machines on a simu-lated node, cloudlets, and allows assigning jobs to appropriate virtual machines. It also enables simulation of various data centers to allow investigations on federation and related policies for the migration of virtual machines.

    1.7  SUMMARY AND OUTLOOK

    The demand for computing and storage resources has been increasing steadily over many years. In the past, static resources of multiple providers were sub-sumed into a grid. The concept of resource sharing within a grid and its draw-backs were described in Section 1.2. Section 1.3 introduced two different virtualization types: resource and platform virtualization. Especially, platform virtualization is one of the key enabling technologies for the concept of cloud computing. Cloud providers act as service providers and offer to customers software (SaaS), development platforms (PaaS), and infrastructure (IaaS). Section 1.4 provided a brief overview of cloud computing and the three main service types (SaaS, PaaS, and IaaS).

    Section 1.5 provided an outlook on efforts attempting to integrate cloud and grid middleware. e-Science initiatives need to provide a uniform and simple interface to both types of resources to facilitate future developments.

    Section 1.6 discussed the need to model and simulate grid and cloud envi-ronments and important grid simulation toolkits (e.g., the Java-based simula-tors GridSim and CloudSim).

    Figure 1.6  cloudSim architecture layers. iface, interface; Vm, virtual machine.

  • 16    State-of-the-art technologieS for large-Scale computing

    REFERENCES

    L. Abadie, P. Badino, J. P. Baud, et al. Grid-enabled standards-based data management. In 24th IEEE Conference on Mass Storage Systems and Technologies, pp. 60–71, Los Alamitos, California, IEEE Computer Society. 2007.

    C. Aiftimiei, M. Sgaravatto, L. Zangrando, et al. Design and implementation of the gLite CREAM job management service. Future Generation Computer Systems, 26(4):654–667, 2010.

    A. Anjomshoaa, F. Brisard, M. Drescher, et al. Job Submission Description Language (JSDL) Specification: Version 1.0, 2005. http://www.gridforum.org/documents/GFD.56.pdf.

    P. Badino, O. Barring, J. P. Baud, et al. The storage resource manager interface specifica-tion version 2.2, 2009. http://sdm.lbl.gov/srm-wg/doc/SRM.v2.2.html.

    W. Bell, D. Cameron, L. Capozza, et al. Simulation of dynamic grid replication strategies in OptorSim. In M. Parashar, editor, Grid Computing, Volume 2536 of Lecture Notes in Computer Science, pp. 46–57, Berlin: Springer, 2002.

    N. Bhatia and J. Vetter. Virtual cluster management with Xen. In L. Bougé, M. Alexander, S. Childs, et al., editors, Euro-Par 2007 Workshops: Parallel Processing, volume 4854 of Lecture Notes in Computer Science, pp. 185–194, Berlin and Heidelberg: Springer-Verlag, 2008.

    F. Bunn, N. Simpson, R. Peglar, et al. Storage virtualization: SNIA technical tutorial, 2004. http://www.snia.org/sites/default/files/sniavirt.pdf.

    S. Burke, S. Campana, E. Lanciotti, et al. gLite 3.1 user guide: Version 1.2, 2009. https://edms.cern.ch/file/722398/1.2/gLite-3-UserGuide.html.

    R. Buyya and M. M. Murshed. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurrency and Computation: Practice and Experience, 14(13–15):1175–1220, 2002.

    R. Buyya, R. Ranjan, and R. N. Calheiros. Modeling and simulation of scalable cloud computing environments and the CloudSim Toolkit: Challenges and opportunities. In Waleed W. Smari and John P. McIntire, editor, Proceedings of the 2009 International Conference on High Performance Computing and Simulation, pp. 1–11, Piscataway, NJ: IEEE, 2009.

    H. Casanova. Simgrid: A toolkit for the simulation of application scheduling. In R. Buyya, editor, CCGRID’01: Proc. of the 1st Int’l Symposium on Cluster Computing and the Grid, pp. 430–441, Los Alamitos, CA: IEEE Computer Society, 2001.

    C. L. Dumitrescu and I. Foster. GangSim: A simulator for grid scheduling studies. In CCGRID’05: Proc. of the 5th IEEE Int’l Symposium on Cluster Computing and the Grid, pp. 1151–1158, Washington, DC: IEEE Computer Society, 2005.

    A. Edmonds, S. Johnston, G. Mazzaferro, et al. Open Cloud Computing Interface Specification version 5, 2009. http://forge.ogf.org/sf/go/doc15731.

    R. Fielding, J. Gettys, J. Mogul, et al. RFC 2616: Hypertext transfer protocol—HTTP/1.1. Status: Standards Track, 1999. http://www.ietf.org/rfc/rfc2616.txt.

    I. Foster. What is the grid?—A three point checklist. GRIDToday, 1(6):22–25, 2002.I. Foster. Globus Toolkit version 4: Software for service-oriented systems. In Network

    and parallel computing, pp. 2–13, 2005.

    http://www.gridforum.org/documents/GFD.56.pdfhttp://www.gridforum.org/documents/GFD.56.pdfhttp://sdm.lbl.gov/srm-wg/doc/SRM.v2.2.htmlhttp://www.snia.org/sites/default/files/sniavirt.pdfhttp://https://edms.cern.ch/file/722398/1.2/gLite-3-UserGuide.htmlhttp://https://edms.cern.ch/file/722398/1.2/gLite-3-UserGuide.htmlhttp://forge.ogf.org/sf/go/doc15731http://www.ietf.org/rfc/rfc2616.txt

  • referenceS    17

    I. Foster, A. Grimshaw, P. Lane, et al. OGSA basic execution service: Version 1.0, 2008. http://www.ogf.org/documents/GFD.108.pdf.

    I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. Int’l Journal of High Performance Computing Applications, 15(3):200–222, 2001.

    F. Howell and R. McNab. SimJava: A discrete event simulation library for Java. In Int’l Conference on Web-Based Modeling and Simulation, pp. 51–56, 1998.

    IEEE Computer Society. IEEE standard for local and metropolitan area networks virtual bridged local area networks, 2006. http://ieeexplore.ieee.org.

    A Mowshowitz. Virtual organization. Communications of the ACM, 40(9):30–37, 1997.H. J. Song, X. Liu, D. Jakobsen, et al. The MicroGrid: A scientific tool for modeling

    computational grids. Scientific Programming, 8(3):127–141, 2000.A. Sulistio, R. Buyya, U. Cibej, et al. A toolkit for modelling and simulating data grids:

    An extension to GridSim. Concurrency and Computation: Practice and Experience, 20(13):1591–1609, 2008.

    J. Tate. Virtualization in a SAN, 2003. http://www.redbooks.ibm.com/redpapers/pdfs/redp3633.pdf.

    J. E. Thornton. Parallel operation in the control data 6600. In AFIPS 1964 Fall Joint Computer Conference, pp. 33–41, Spartan Books, 1965.

    L. M. Vaquero, L. Rodero-Merino, J. Caceres, et al. A break in the clouds: Towards a cloud definition. ACM SIGCOMM Computer Communication Review, 39(1):50–55, 2009.

    http://www.ogf.org/documents/GFD.108.pdfhttp://ieeexplore.ieee.orghttp://www.redbooks.ibm.com/redpapers/pdfs/redp3633.pdfhttp://www.redbooks.ibm.com/redpapers/pdfs/redp3633.pdf

  • Chapter 2The e-Infrastructure

    Ecosystem: Providing Local Support to

    Global ScienceErwin Laure and Åke Edlund

    KTH Royal Institute of Technology, Stockholm, Sweden

    2.1  THE WORLDWIDE E-INFRASTRUCTURE LANDSCAPE

    Modern science is increasingly dependent on information and communication technologies (ICTs), analyzing huge amounts of data (in the terabyte and pet-abyte range), running large-scale simulations requiring thousands of CPUs (in the teraflop and petaflop range), and sharing results between different research groups. This collaborative way of doing science has led to the creation of virtual organizations that combine researches and resources (instruments, computing, and data) across traditional administrative and organizational domains (Foster et al., 2001). Advances in networking and distributed computing techniques have enabled the establishment of such virtual organizations, and more and more scientific disciplines are using this concept, which is also referred to as grid computing (Foster and Kesselman, 2003; Lamanna and Laure, 2008). The past years have shown the benefit of basing grid computing on a well-managed

    19

    Large-Scale Computing, First Edition. Edited by Werner Dubitzky, Krzysztof Kurowski, Bernhard Schott.© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc.

  • 20    PROVIDING LOCAL SUPPORT TO GLOBAL SCIENCE

    infrastructure federating the network, storage, and computing resources across a big number of institutions and making them available to different scientific communities via well-defined protocols and interfaces exposed by a software layer (grid middleware). This kind of federated infrastructure is referred to as e-infrastructure. Europe is playing a leading role in building multinational, multidisciplinary e-infrastructures. Initially, these efforts were driven by aca-demic proof-of-concept and test-bed projects, such as the European Data Grid Project (Gagliardi et al., 2006), but have since developed into large-scale, production e-infrastructures supporting numerous scientific communities. Leading these efforts is a small number of large-scale flagship projects, mostly cofunded by the European Commission, which take the collected results of predecessor projects forward into new areas. Among these flagship projects, the Enabling Grids for E-sciencE (EGEE) project unites thematic, national, and regional grid initiatives in order to provide an e-infrastructure available to all scientific research in Europe and beyond in support of the European Research Area (Laure and Jones, 2009). But EGEE is only one of many e-infrastructures that have been established all over the world. These include the U.S.-based Open Science Grid (OSG)1 and TeraGrid2 projects, the Japanese National Research Grid Initiative (NAREGI)3 project, and the Distributed European Infrastructure for Supercomputing Applications (DEISA)4 project, a number of projects that extend the EGEE infrastructure to new regions (see Table 2.1), as well as (multi)national efforts like the U.K. National Science Grid (NGS),5 the Northern Data Grid Facility (NDGF)6 in northern Europe,

    TABLE 2.1  Regional e-Infrastructure Projects Connected to EGEE

    Project Web Site Countries Involved

    BalticGrid http://www.balticgrid.eu Estonia, Latvia, Lithuania, Belarus, Poland, Switzerland, and Sweden

    EELA http://www.eu-eela.org Chile, Cuba, ItalyArgentina, BrazilMexico, Peru, Portugal, Spain, Venezuela

    EUChinaGrid http://www.euchinagrid.eu China, Greece, Italy, Poland, and TaiwanEUIndiaGrid http://www.eumedgrid.eu India, Italy, and United KingdomEUMedGrid http://www.euindiagrid.eu Algeria, Cyprus, Egypt, Greece, Jordan

    Israel, Italy, Malta, Morocco, PalestineSpain, Syria, Tunisia, Turkey, United Kingdom

    SEE-GRID http://www.see-grid.eu Croatia, Greece, Hungary, FYR of MacedoniaAlbania, Bulgaria, Bosnia and HerzegovinaMoldova, Montenegro, Romania, Serbia, Turkey

    6 http://www.ndgf.org.

    5 http://www.ngs.ac.uk.

    4 http://www.deisa.org.

    3 http://www.naregi.org.

    2 http://www.teragrid.org.

    1 http://www.opensciencegrid.org.

    http://www.balticgrid.euhttp://www.eu-eela.orghttp://www.euchinagrid.euhttp://www.eumedgrid.euhttp://www.euindiagrid.euhttp://www.see-grid.euhttp://www.ndgf.orghttp://www.ngs.ac.ukhttp://www.deisa.orghttp://www.naregi.orghttp://www.teragrid.orghttp://www.opensciencegrid.org

  • A REGIONAL E-INfRASTRUCTURE, LEVERAGING ON ThE GLOBAL “MOThERShIP” EGEE    21

    and the German D-Grid.7 Together, these projects cover large parts of the world and a wide variety of hardware systems.

    In all these efforts, providing support and bringing the technology as close as possible to the scientist are of utmost importance. Experience has shown that users require local support when dealing with new technologies rather than a relatively anonymous European-scale infrastructure. This local support has been implemented, for instance, by EGEE through regional operation centers (ROCs) as well as DEISA, which assigns home sites to users, which are providing user support and a base storage infrastructure. In a federated model, local support can be provided by national or regional e-infrastructures, as successfully demonstrated by the national and regional projects mentioned earlier. These projects federate with international projects like EGEE to form a rich infrastructure ecosystem under the motto “think globally, act locally.” In the remainder of this chapter, we exemplify this strategy on the BalticGrid and EGEE projects, discuss the impact of clouds on the ecosystem, and provide an outlook on future developments.

    2.2  BALTICGRID: A REGIONAL E-INFRASTRUCTURE, LEVERAGING ON THE GLOBAL “MOTHERSHIP” EGEE

    To establish a production-quality e-infrastructure in a greenfield region like the Baltic region, a dedicated project was established, the BalticGrid8 project. According to the principle of think globally, act locally, the aim of the project was to build a regional e-infrastructure that seamlessly integrates with the international e-infrastructure of EGEE. In the first phase of the BalticGrid project, starting in 2005, the necessary network and middleware infrastructure was rolled out and connected to EGEE. The main objective of BalticGrid’s second phase, BalticGrid-II (2008–2010), was to further increase the impact, adoption, and reach of e-science infrastructures to scientists in the Baltic States and Belarus, as well as to further improve the support of services and users of grid infrastructures. As with its predecessor BalticGrid, BalticGrid-II continued with strong links with EGEE and its technologies, with gLite (Laure et al., 2006) as the underlying middleware of the project’s infrastructure.

    2.2.1  The BalticGrid Infrastructure

    The BalticGrid infrastructure is in production since 2006 and has been used significantly by the regional scientific community. The infrastructure consists of 26 clusters in five countries, of which 18 are on the EGEE production infra-structure, with more than 3500 CPU cores, 230 terabytes of storage space.

    One of the first challenges of the BalticGrid project was to establish a reli-able network in Estonia, Latvia, and Lithuania as well as to ensure optimal

    8 http://www.balticgrid.eu.

    7 http://www.d-grid.org.

    http://www.balticgrid.euhttp://www.d-grid.org

  • 22    PROVIDING LOCAL SUPPORT TO GLOBAL SCIENCE

    network performance for large file transfers and interactive traffic associated with grids. This was successfully achieved in the projects in the first year exploiting the European GéANT network infrastructure, adding Belarus in the second phase.

    2.2.2  BalticGrid Applications: Providing Local Support to  Global Science

    The resulting BalticGrid infrastructure supports and helps scientists from the region to foster the use of modern computation and data storage systems, enabling them to gain knowledge and experience to work in the European research area.

    The main application areas within the BalticGrid are from high-energy physics, materials science and quantum chemistry, framework for engineering modeling tasks, bioinformatics and biomedical imaging, experimental astro-physical thermonuclear fusion (in the framework of the ITER project), lin-guistics, as well as operational modeling of the Baltic Sea ecosystem.

    To support these, a number of applications have been ported, often leverag-ing on earlier EGEE work, to the BalticGrid:

    • ATOM: a set of computer programs for theoretical studies of atoms• Complex Comparison of Protein Structures: an application that offers a

    method applied for the exploration of potential evolutionary relation-ships between the CATH protein domains and their characteristics

    • Crystal06: a quantum chemistry package to model periodic systems• Computational Fluid Dynamics (FEMTOOL): modeling of viscous

    incompressible free surface flows• DALTON 2.0: a powerful molecular electronic structure program with

    extensive functions for the calculation of molecular properties at different levels of theory

    • Density of Montreal: a molecular electronic structure program• ElectroCap Stellar Rates of Electron Capture: a set of computer codes

    produce nuclear physics input for core-collapse supernova simulations• Foundation Engineering (Grill): global optimization of grillage-type

    foundations using genetic algorithms• MATLAB: distributed computing server• MOSES SMT Toolkit (with SRILM): a factored phrase-based beam

    search decoder for machine translators• NWChem: a computational chemistry package• Particle Technology (DEMMAT): particle flows, nanopowders, and mate-

    rial structure modeling on a microscopic level using the discrete element method

  • A REGIONAL E-INfRASTRUCTURE, LEVERAGING ON ThE GLOBAL “MOThERShIP” EGEE    23

    • Polarization and Angular Distributions in Electron-Impact Excitation of Atoms (PADEEA)

    • Vilnius Parallel Shell Model Code: an implementation of nuclear spheri-cal shell model approach

    2.2.3  The Pilot Applications

    To test, validate, and update the support model, a smaller set of pilot applica-tions was chosen. These applications received special attention during the initial phase of BalticGrid-II and helped shape BalticGrid’s robust and cost-efficient overall support model.

    2.2.3.1  Particle Technology (DEMMAT)  The development of the appro-priate theoretical framework as well as numerical research tools for the predic-tion of constitutive behavior with respect to microstructure belongs to major problems of computational sciences. In general, the macroscopic material behavior is predefined by the structure of grains of various sizes and shapes, or even by individual molecules or atoms. Their motion and interaction have to be taken into account to achieve a high degree of accuracy. The discrete element method is an attractive candidate to be applied for modeling granular flows, nanopowders, and other materials on a microscopic level. It belongs to the family of numerical methods and refers to conceptual framework on the basis of which appropriate models, algorithms, and computational technolo-gies are derived (Fig. 2.1).

    The main disadvantages of the discrete element method technique, in com-parison to the well-known continuum methods, are related to computational capabilities that are needed to handle a large number of particles and a short time interval of simulations. The small time step imposed in the explicit time integration schemes gives rise to the requirement that a very large number of time increments should be performed. Grid and distributed computing tech-nologies are a standard way to address industrial-scale computing problems.

    Figure 2.1  Particle flow during hopper discharge. Particles are colored according to resultant force.

  • 24    PROVIDING LOCAL SUPPORT TO GLOBAL SCIENCE

    Interdisciplinary cooperation and development of new technologies are major factors driving the progress of discrete element method models and their countless applications (e.g., nanopowders, compacting, mixing, hopper dis-charge, and crack propagation in building constructions).

    2.2.3.2  Materials Science, Quantum Chemistry  NWChem9 is a compu-tational chemistry package that has been developed by the Molecular Sciences Software group of the Environmental Molecular Sciences Laboratory at the Pacific Northwest National Laboratory, USA.

    NWChem provides many methods to compute the properties of molecular and periodic systems using standard quantum mechanical descriptions of the electronic wave function or density. In addition, NWChem has the capability to perform classical molecular dynamics and free energy simulations. These approaches may be combined to perform mixed quantum mechanics and molecular mechanics simulations.

    2.2.3.3  CORPLT:  Corpus  of  Academic  Lithuanian  The corpus (large and structured set of texts) was designed as a specialized corpus for the study of academic Lithuanian, and it will be the first synchronic corpus of academic written Lithuanian in Lithuania. It will be a major resource of authentic lan-guage data for linguistic research of academic discourse, for interdisciplinary studies, lexicographical practice, and terminology studies in theory and prac-tice. The compilation of the corpus will follow the most important criteria: methods, balance, representativeness, sampling, TEI P5 Guidelines, and so on. The grid application will be used for testing algorithms of automatic encoding, annotation, and search-analysis steps. Encoding covers recognition of text parts (sections, titles, etc.) and correcting of text flow. Linguistic annotation consists of part of speech tagging, part of sentence tagging, and so on. The search-analysis part deals with the complexity level of the search and tries to distribute and effectively deal with the load for corpus services.

    2.2.3.4  Complex  Comparison  of  Protein  Structures  The Complex Comparison of Protein Structures application supports the exploration of potential evolutionary relationships between the CATCH protein domains and their characteristics and uses an approach called 3D graphs. This tool facilitates the detection of structural similarities as well as possible fold muta-tions between proteins.

    The method employed by the Complex Comparison of Protein Structures tool consists of two stages:

    • Stage 1: all-against-all comparison of CATCH domains by the ESSM software

    • Stage 2: construction of fold space graphs on the basis of the output of the first stage

    9 http://www.nwchem-sw.org.

    http://www.nwchem-sw.org

  • ThE EGEE INfRASTRUCTURE    25

    This method is used for evolutionary aspects of protein structure and func-tion. It is based on the assumption that protein structures, similar to sequences, have evolved by a stepwise process, each step involving a small change in protein fold. The application (“gridified” within BalticGrid-II) accesses the Protein DataBase to compare proteins individually and in parallel.

    2.2.4  BalticGrid’s Support Model

    By working with these pilot applications, a robust, scalable, and cost-efficient support model could be developed. Although usuall