Epidemiology Experimentation and Simulation Management ...Epidemiology Experimentation and Simulation Management through Scientific Digital Libraries Jonathan P. Leidig (ABSTRACT)

Epidemiology Experimentation and Simulation Managementthrough Scientific Digital Libraries

Jonathan P. Leidig

Dissertation submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophyin

Computer Science and Applications

Edward A. Fox, Co-ChairMadhav V. Marathe, Co-Chair

Pete H. BeckmanKeith R. Bisset

Ali R. ButtHenning S. Mortveit

July 20, 2012Blacksburg, Virginia

Keywords: Computational Epidemiology, Digital Libraries, Simulation and ExperimentationManagement, 5S Formalisms

Copyright 2012, Jonathan P. Leidig

Epidemiology Experimentation and Simulation Management through ScientificDigital Libraries

Jonathan P. Leidig

(ABSTRACT)

Advances in scientific data management, discovery, dissemination, and sharing are changing themanner in which scientific studies are being conducted and repurposed. Data-intensive scientificpractices increasingly require data management related services not available in existing digitallibraries. Complicating the issue are the diversity of functional requirements and content in scientificdomains as well as scientists’ lack of expertise in information and library science.

Researchers that utilize simulation and experimentation systems need digital libraries to maintaindatasets, input configurations, results, analyses, and related documents. A digital library may beintegrated with simulation infrastructures to provide automated support for research components,e.g., simulation interfaces to models, data warehouses, simulation applications, computationalresources, and storage systems. Managing and provisioning simulation content allows streamlinedexperimentation, collaboration, discovery, and content reuse within a simulation community. Formaldefinitions of this class of digital libraries provide a foundation for producing a software toolkit andthe semi-automated generation of digital library instances.

We present a generic, component-based SIMulation-supporting Digital Library (SimDL) framework.The framework is formally described and provides a deployable set of domain-free services, schema-based domain knowledge representations, and extensible lower and higher level service abstractions.Services in SimDL are specialized for semi-structured simulation content and large-scale dataproducing infrastructures, as exemplified in data storage, indexing, and retrieval service implemen-tations. Contributions to the scientific community include previously unavailable simulation-specificservices, e.g., incentivizing public contributions, semi-automated content curating, and memoizingsimulation-generated data products. The practicality of SimDL is demonstrated through several casestudies in computational epidemiology and network science as well as performance evaluations.

Acknowledgments

I wish to thank my co-advisors, Dr. Edward A. Fox and Dr. Madhav Marathe, for the constantdedication, direction, and support they have provided. They have given me a framework fordiscovering my academic interests and a means to completing my research. I reflect in amazementon the expertise, knowledge, external networks, and time you have shared with me. I thank you forall you have taught me regarding academia, research, teaching, writing, and life. I thank Drs. KeithBisset, Ali Butt, and Henning Mortveit for demonstrating and instilling a passion for excellence inresearch and development. I thank Dr. Pete Beckman for hosting an exciting summer of researchand the chance to appreciate life at ANL. Thank you all for devoting significant time towardsmy career and dissertation. I am grateful for our association and the opportunity to complete mydoctoral studies, working with you.

I had the privilege of working for five years alongside two wonderful laboratories at Virginia Tech.I thank all of the students, staff, and faculty associated with the Network Dynamics and SimulationScience Lab (NDSSL) and the Digital Library Research Lab (DLRL). I appreciate all of the timespent collaborating, discovering, researching, and writing papers. I also thank the Department ofComputer Science; DTRA R&D Grants 1-09-1-0017, 1-11-1-0016, and 1-11-D-0016-0001; NIHMIDAS 2U01GM070694-09 and 3U01FM070694-09S1; NSF PetaApps OCI-0904844, NetSECNS-1011769, and SDCI OCI-1032677; and Swiss TPH for financial and travel support.

I thank my friends Gerrit, AJ, and Devin, along with close friends and academic colleagues toonumerous to enumerate here, for forging a common path through life in Michigan and Virginia. Itwould be unfair to only list a subset of you all here, while I would need to write a book to fullythank you individually. I thank these companions, some for sharpening my faith, some for sharingmy passions, and some simply for sharing themselves.

I thank all of my parents, siblings, and grandparents for the decades of love, instruction, sacrifices,and time you have devoted to me. Thank you for believing in my abilities and eventual success.You have imparted what it means to be an honorable professor, family member, servant, and witness.Thank you for growing and growing up with me.

Finally, I thank my wife, Karly, for bending her life and plans to mine, sharing my burdens, andjoining in my accomplishments. You have been the rock that supported all of my efforts in life andacademia. Thank you for your patience, understanding, and tireless work beside me. Ultimately, Iattribute my accomplishments to my faith and daily walk with Jesus Christ.

iii

Dedication

To my wife for her unfailing faith, hope, and love;to my parents for their direction and purpose; andto my siblings for their unwavering confidence and support.

This dissertation is mine only because you lifted me high enough to reach for it.

iv

Contents

1 Introduction 1

1.1 Computational Epidemiology Simulation . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Digital Library Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Justification of Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 How is it Different from a Database? . . . . . . . . . . . . . . . . . . . . 5

1.4.2 Why is a Digital Library Needed? . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.2 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 10

2.1 Researchers, Users, and Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Users, Researchers, and How to Understand Scientists . . . . . . . . . . . 11

2.1.2 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Notable Related Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Institutional Digital Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Scientific Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Scientific Content Repositories . . . . . . . . . . . . . . . . . . . . . . . 18

v

2.4.3 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Computational Epidemiology and Network Science 22

3.1 Integrated Modeling Environments . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Models and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Public Health Disease Models . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.2 Simdemics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Case Study: CINET DL . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2 Case Study: Open Malaria DL . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.3 EpiSimdemics and Simfrastructure DL . . . . . . . . . . . . . . . . . . . 29

3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Experiments and Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 Need for SimDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 Simulation and Experiment Supporting Digital Libraries . . . . . . . . . . . . . . 34

4 Digital Library Foundations 37

4.1 Formal Definitions for Scientific Contexts . . . . . . . . . . . . . . . . . . . . . . 40

4.1.1 DL Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2 DL Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.3 DL Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.4 Takeaway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Background Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Societies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Scenarios and Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Simulation Digital Library Framework 53

vi

5.1 A Simulation Supporting Digital Library . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.1 Computational Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.2 Scientific Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.3 Data Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1.4 Quality of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.5 Integration and Framework Extensions . . . . . . . . . . . . . . . . . . . 57

5.2 The Software Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Service Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4.1 Functionality, Services, Scenarios . . . . . . . . . . . . . . . . . . . . . . 64

5.4.2 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 Simulation-Specific Services 67

6.1 Service Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 Novel Simulation-Specific Services . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.3 Infrastructure Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.4 Curate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.5 Digital Object Collection Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.6 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.7 Get DLID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.8 Get Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.9 Incentivize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.10 Log and Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.11 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.12 Provenance Register and Resolve . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.13 Search in Scientific Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.14 Supporting Metadata Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.14.1 Generating Metadata Description Sets . . . . . . . . . . . . . . . . . . . . 88

vii

6.14.2 Extracting Metadata Values . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.14.3 Forming Metadata Records . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.14.4 Indexing Metadata Records . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.14.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 Evaluation 96

7.1 Methodology and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1.1 Formal Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1.3 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.2 Framework Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.2.1 DLMS Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.2.2 DLS Software Toolkit Effectiveness . . . . . . . . . . . . . . . . . . . . . 99

7.2.3 DLS Software Toolkit Efficiency (Service Performance) . . . . . . . . . . 99

7.2.4 Service Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Conclusions and Future Work 104

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.2 Hypothesis Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.3.1 Functional Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.3.2 Cloud, Grid, and Cluster Supporting DL Services . . . . . . . . . . . . . . 107

viii

List of Figures

1.1 Information lifecycle for scientific content within SimDL. . . . . . . . . . . . . . 2

3.1 Overview of a model schema (left), input terms for a selected level of the schema(foreground), and input configuration file being generated (right). . . . . . . . . . 28

3.2 Digital object model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Structured workflow of content as produced by the research environments thatSimDL supports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Broader context of services and system components in SimDL. . . . . . . . . . . . 36

4.1 Conceptual simulation digital library universe based on the DELOS ReferenceModel using 5S-encoded formal definitions. . . . . . . . . . . . . . . . . . . . . . 39

4.2 Metamodel of SimDL’s architectural design. . . . . . . . . . . . . . . . . . . . . . 41

4.3 Definitional dependencies for the minimal simulation-supporting digital library. . . 43

4.4 Representation of scientific document workflow. See [87] for additional informationregarding the workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Organization of the minimal simulation-supporting services and communication. . 50

4.6 DL modeling and generation architecture guided by DL generation theory providedin [49]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1 Prototypical simulation-based research infrastructural environment. . . . . . . . . 54

5.2 Information workflow in conducting and archiving an experiment. . . . . . . . . . 56

5.3 SimDL internal data flow architecture. . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 SimDL component execution control flow. . . . . . . . . . . . . . . . . . . . . . . 59

5.5 Possible science supporting services. . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.6 SimDL service dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

ix

6.1 Organization of the content workflow, communication interface for content accessand service exposure, low-level services, layered high-level services, and usercommunities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 5S graph of collaboration and related services. . . . . . . . . . . . . . . . . . . . . 70

6.3 Message passing in a blackboard JINI-space coordinated SimDL environment. . . 71

6.4 Direct invocation of SimDL services in a SimDL infrastructure. . . . . . . . . . . 71

6.5 Metadata description set term organizational structure. . . . . . . . . . . . . . . . 89

6.6 Metadata description set for input configuration files to the EpiSimdemics model.This XML representation is simply an alternative, concise translation from theRDF-triplestore used to store the metadata records. Each of these lines of XML isstored in a triple, e.g., (‘id:0’, ‘dc:creator’, ‘user:leidig’). . . . . . . . . . . . . . . 93

x

List of Tables

4.1 Informal service definitions [52]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Formal service definitions developed from initial work proposed in [52]. . . . . . . 45

4.3 Standard low level formal symbols as described in [50]. . . . . . . . . . . . . . . . 45

4.4 Informal service definitions for simulation science. . . . . . . . . . . . . . . . . . 50

5.1 SimDL service and broker classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 SimDL service and broker classes (continued). . . . . . . . . . . . . . . . . . . . 62

6.1 Blackboard-based DLBroker service. . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Formal service definitions for curating (deletion of content operation). . . . . . . . 73

6.3 DLCurate service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 DLCurateDORanker service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.5 Formal service definitions for filtering. . . . . . . . . . . . . . . . . . . . . . . . . 78

6.6 DLFilter service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.7 Formal service definitions for getting a DL identifier. . . . . . . . . . . . . . . . . 79

6.8 DLGetDLID service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.9 Formal service definitions for getting metadata. . . . . . . . . . . . . . . . . . . . 79

6.10 DLGetMetadata service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.11 Formal service definitions for incentivizing. . . . . . . . . . . . . . . . . . . . . . 80

6.12 DLIncentivize service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.13 Partial statistics in CINET incentivizing. . . . . . . . . . . . . . . . . . . . . . . . 81

6.14 Formal service definitions for logging. . . . . . . . . . . . . . . . . . . . . . . . . 82

xi

6.15 DLLog and DLRegister service. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.16 Formal service definitions for memoization. . . . . . . . . . . . . . . . . . . . . . 83

6.17 DLMemoization service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.18 Formal service definitions for provenance tracking. . . . . . . . . . . . . . . . . . 84

6.19 DLProvRegister service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.20 DLProvResolve service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.21 Formal service definitions for searching. . . . . . . . . . . . . . . . . . . . . . . . 86

6.22 DLSearch service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1 DLMS definition completeness. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2 DLMS definitions and DLS toolkit comparison. . . . . . . . . . . . . . . . . . . . 100

7.3 Evaluation of the DLBroker in milliseconds. . . . . . . . . . . . . . . . . . . . . . 100

7.4 Evaluation of the DL services with time resolution in milliseconds (with small,medium, and large indexes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.5 Service deployment efforts for the CINET case study. . . . . . . . . . . . . . . . . 103

xii

Chapter 1

Introduction

In the broad scientific community, there exist numerous large collections of data intensive experi-ments. In particular, the computing, biology, mechanical engineering, and physics domains tendto produce large datasets of computer generated output from experimentation systems. Acquiringstatistically significant results from a single set of input parameters may require thousands ofexperiment runs to be conducted. Each of these runs may take a matter of seconds to several weeksto complete. In addition, the equipment, computing systems, human efforts, and other resourcesmake experiments expensive to run and especially wasteful to rerun. This cost makes it difficult andexpensive for experiments to be replicated if data is not preserved. Even for collaborators workingin similar areas, replicating experiments with the exact same inputs, machines, and software toolsmay be immensely difficult or prohibited. In many cases, the exact environment and input settingsare not recorded in a manner that would allow for an exact duplication of an experiment overlong-term time frames. In some groups, data might be kept organized only while actively used andanalyzed by researchers. In many domains and organizations, data are also very infrequently sharedexternally or managed in a manner that allows for repurposing. Funding and intellectual propertyconcerns may discourage sharing of information with potential collaborators. Thus, few systemsare in place for collaborators willing to work cooperatively on similar problems.

A digital library (DL) that makes extensive use of stored metadata on the methodology of datageneration will allow enhanced reuse of existing experiments as well as the ability to referencethem later. Although digital libraries for the scientific community exist, these systems primarilyfocus on storing published papers interpreting data instead of archiving data itself. In fact, there arenumerous examples of archiving valued scientific content with less importance placed on storingthe data that higher level content is based upon [32, 86, 100]. It is currently difficult for scientificclaims made in high-impact papers to be verified. The SIMulation Digital Library (SimDL) is atheoretical framework, service toolkit, and practical instantiation that allows information in theepidemiology and network science simulation domains to be stored, managed, reused, and curatedthroughout a structured lifecycle, see Fig. 1.1. This is accomplished through the customized digitallibrary which maintains information concerning contextual scenarios, underlying datasets, inputs,

1

Jonathan P. Leidig Chapter 1. Introduction 2

results, analyses, and related content for each experimentation. [31] discusses the concept of aninformation ecology composed of an organization’s entire information environment. SimDL closelyfollows this concept by addressing the scientific workflows, user activities, information usage,data sources, information products, and information management priorities of simulation-basedresearch groups in several domains. In addressing these concepts, a software instance deployment ofSimDL provides the minimal functionality required to support a range of simulation-based scientificactivities. In this work, I have produced a generic software toolkit that may be used to produceSimDL instances. This generic software may be utilized in multiple domains, e.g., epidemiologyand network science, for data management services.

Figure 1.1: Information lifecycle for scientific content within SimDL.

1.1 Computational Epidemiology Simulation

Epidemiology encompasses the study of health patterns in a population and factors that contributeto health. Computational epidemiology involves the development and utilization of computermodels to understand the spatio-temporal diffusion of diseases through populations. Animalhealth simulations require models for avian flu, swine flu, and mad-cow diseases. Human health


simulations include communicable diseases (e.g., influenza), vector-borne diseases (e.g., malaria),and immunopathology. Social network simulations include human manipulation, fads, marketing,and computer virus diffusions. These types of simulations are used to test public health policies,develop intervention strategies, and predict network diffusion end-states.

Organizations such as the Center for Disease Control and Prevention (CDC), Centre for Researchon the Epidemiology of Disasters (CRED), European Centre for Disease Prevention and Control,Health Protection Agency, World Health Organization, and public health organizations in numerouscountries study, among other things, the health state of a population and factors contributing tochanges in the health state. Computational approaches can be applied for these studies to simulatehow a disease disperses throughout a human population. Studies are conducted, in general, throughthe use of a simulation software tool that requires information on the population structure, agentbehavior, disease transmission, and a model of the disease. With this information, a simulation maystudy the characteristics of the disease spread. These studies may be generalized to the simulationof the diffusion process of a contagion across a given network. The generalized approach allowssoftware tools to simulate animal health, human health, human manipulation, computer states,and adaptive agent behavior. These simulation software tools require specialized datasets on thepopulation of agents as well as parameters to the scenario-specific models. After a simulation, theset of results produced by the software tool may include information on the state of the simulationat given time points as well as the end state of the scenario. Public health officials and agriculturaladministrators rely on simulations of a scenario considering potential interventions to set publicpolicy. Network science researchers may utilize the same approach to model the parallel scenariosof computer virus diffusion, transportation flows, or social network marketing. The results providedby computational experiments are used to set these health policies, define response strategies, andraise awareness of issues. To serve this purpose, managed experiment collections are needed toprovide researchers with a mechanism to collaborate for the purpose of developing comprehensivesolutions to modeled historical and envisioned real-world scenarios.

1.2 Problem Formulation

Recent collaborations between a set of computational epidemiology institutes have identified theneed for digital libraries to support simulation infrastructures. Digital libraries may be leveragedto support data management, data provisioning, information satisfaction, user interface requests,user interactions, and computation submission functions for modeling and simulation systems.Simulations within the computational epidemiology domain allow public health officials to ex-periment and analyze potential policies for various contexts. This experimentation process makesuse of a sequence of digital content types; simulation schemas, input parameter configurations,raw datasets, result summaries, analyses and plots, documentation, publications, and annotations.Content may be captured at each stage of the scientific process, forming stage-specific collectionsof digital objects. Different users are interested in generating and accessing content from specificcollections in order to support specific user roles and tasks. Many standalone simulation applications


and online repositories interact with users through a highly customized, static web or commandline interface for providing input parameters and returning results. Grid, cloud, and volunteercomputational resources form the backend of computational epidemiology simulation and analysisapplications. Digital libraries (DLs) may be incorporated into the simulation infrastructure to enablecohesive assess to content and provide seamless functionality between system components. A DLfront-end to a simulation system may automate: the process of conducting analyses; the generationof plots; organization and management of simulation-related metadata and experimental collections;dissemination of studies and findings; provision of simulation-related content; and a mechanism forcollaboration between simulation researchers, software developers, study designers, analysts, andpolicy setters.

Formal descriptions of a DL framework, software toolkit, services, and content enable a semi-automated process to generate and verify digital libraries tasked with supporting a simulation system.Descriptions of each DL element must be clear as services (e.g., data management) are providedby a non-human, automated software infrastructure. By supporting the formal definition and rapidgeneration of simulation supporting digital libraries, we envision the semi-automated developmentof digital library systems; increased collaborations within and between research institutions; andeffective management, discovery, reuse, and curation of existing content. Though our SimDLcase study systems are within the computational epidemiology domain, domain and contextualinformation are encapsulated inside schemas to allow extensions into other simulation fields ornon-simulation experimentation processes. As such, we interchangeably term this class of digitallibraries as scientific, experiment, and simulation supporting digital libraries.

1.3 Digital Library Overview

SimDL is defined as a digital library tasked with at least minimally supporting computationalepidemiology. SimDL is a fully formalized digital library management system with a completesoftware toolkit implementation. Digital libraries have been concisely stated as ‘‘a system thatprovides a community of users with coherent access to a large, organized repository of informationand knowledge’’ [100]. In this particular instance, the community of users encompasses numerouscomputational and social scientists interested in computational epidemiology or network sciencefor purposes ranging from the simulation of network dynamics to the refinement of public healthpolicies.

Borgman et al. expanded the above definition to ‘‘digital libraries are a set of electronic resourcesand associated technical capabilities for creating, searching, and using information... The content ofdigital libraries includes data, metadata that describe representation, creator, owner, reproductionrights, and metadata that consist of links or relationships to other data or metadata, whether internalor external to the digital library’’ [16]. A vital aspect to SimDL is its ability to communicatewith existing epidemiology simulation tools and infrastructures to create new content. Metadatais leveraged for linking a provenance sequence of a scientific workflow including input data,


simulation results, network analyses, generated reports, multimedia files, and publications.

Borgman et al. again refined the definition by adding: ‘‘digital libraries are constructed - collectedand organized - by [and for] a community of users, and their functional capabilities support theinformation needs and uses of that community. They are a component of communities in whichindividuals and groups interact with each other, using data, information, and knowledge resourcesand systems. In this sense they are an extension, enhancement, and integration of a variety ofinformation institutions as physical places where resources are selected, collected, organized,preserved, and accessed in support of a user community’’ [16]. Many communities would greatlybenefit by having a digital library archiving the information generated through experimentation.In the modeling and simulation communities, there are many context-specific applications thatcould make use of such a digital library. Simulation environments typically funded by the DOD,DOE, and NIH include transportation and epidemiology, along with numerous other discrete eventapplications [9]. A collaborative group of individuals in each of these separate areas could makeuse of an archive of all simulations conducted through the use of SimDL. Development withthe library-focused aspect of a digital library facilitates a transdisciplinary approach to bringingcollections, activities, research efforts, and researchers together throughout the scientific workflowprocess [103]. SimDL strives to support collaboration within a transdisciplinary group with variedinterests relating to communal content.

1.4 Justification of Effort

Archives provide the ability to utilize epidemiology experiments at later points of time to extractor repurpose information from the existing base of simulations. Since large-scale climate, epi-demiology, or transportation simulations can take days or weeks to conduct, such a digital librarycould prove very beneficial in reducing resource consumption and wait time. The storage of thecomputational epidemiology experimental information yields benefit of reducing duplication ofprevious experiments and faster turnaround times for existing data. Even within small researchgroups, SimDL may reduce duplication of experiments caused by a lack of effective communication.Collections of epidemiology simulations allow for comparison and validation of real world data tostates predicted by a model. This allows researchers to predict future real world states and fine-tuneepidemiological models. The digital library promotes managing and sharing of scientific contentacross user roles along with dissemination to other researchers.

1.4.1 How is it Different from a Database?

SimDL provides high-level services over underlying collections of content. Databases are widelyused to structure collections of data at a low level of complexity. Database management systems(DBMS) strive to maintain and utilize collections of data, while, resource planning packages addservice layers to a DBMS [124]. In a similar fashion, digital libraries make use of underlying


databases and DBMS that provide the ability to access stored content in order to provide an extralayer of services exceeding simple storage. Further development of an initial storage structure into aservice-driven repository is required when needs are not met by a simple structure or DBMS. Digitallibraries provide the ability for users to develop representative metadata indexes linked to data objectcollections. Metadata indexes enable search for documents without a detailed knowledge of theentire collection and its structure. Composite digital objects are used to link content from multiplecollections. At its lowest level, SimDL incorporates an underlying storage structure includingdatabases and a DBMS (e.g., Oracle and MySQL). The two storage layers, content and metadata,provide the basis for low- and high-level services.

SimDL is designed to extend existing simulation systems. It interacts with simulation softwareto launch new simulations and stores returned results. With the necessary simulation input andresult data structures, the system is capable of automating the capture of simulation-based ex-perimentation. The data format and structure of an experiment determines the storage requiredto capture these simulations in the digital library. In addition to simulation data, the semanticsbehind the experiment can be automatically extracted and included in metadata records. Thismetadata allows the association of experimental data with higher-level analyses, graphs, plots, andreports. As additional content is manually produced, digital objects can be linked to the relevantset of experiments. Search is enabled over the data by determining with domain knowledge thesimilarity of a query to the existing range of experiments and identifying the set of potentiallyrelated digital objects. This approach requires the usage of customized database and filesystemstorage for simulation content to reduce the storage footprint for large collections. To promotecollaboration between researchers, contributors, study designers, end-users, and resource managers,memoization and incentivization reporting is provided in SimDL. In short, SimDL is not simplya retrievable collection of documents but a tool for bringing together collections, services, andresearchers throughout the entire experimentation lifecycle of information creation, diffusion, usage,and preservation - a need suggested by [103].

1.4.2 Why is a Digital Library Needed?

Computational epidemiology requires a transdisciplinary group of researchers to solve problems.The approach generally requires systems modeling, simulation software construction, statisticalanalysis, and public health interpretation and validation. As experimentation is continuous, it isimportant to make all information on a given experiment available to the set of researchers workingon various aspects of the scientific workflow. A digital library may be used to automate collectiondevelopment, collection management, and distribution of scientific materials to collaborators. Asimple storage structure would not meet the objectives as a digital library is required to providesupport for higher level services and abstractions. SimDL meets these needs through its middlewareservices layer which makes use of the underlying storage system and domain information to provideinsightful connections between objects in the collection. Existing DL software is not capable ofsupporting large-scale, complex data collections and providing science-based services.


1.5 Hypothesis

1.5.1 Objectives

Current socio-technical system modeling environments leveraging High Performance Computing(HPC) are complex systems for specialized domains. The use of these systems is primarily accessibleto a set of highly specialized technical personnel fluent in the domain and usage of the softwareapplication. Widespread access to simulation and analysis content is enhanced with the inclusion ofa digital library to the modeling infrastructure. The vision for SimDL is to extend an infrastructure’sexisting information procurement and delivery mechanism to provide a wider set of researcherswith access to HPC-models and analytical tools. Similar to the effect of search engines on alteringresearch and analysis patterns, SimDL makes access to HPC resources seamless through contentorganization and accessibility mechanisms.

The goal of this effort is to provide collaborative access to scientific content in a comprehensive andmeaningful way. The proposed digital library solution should adequately capture enough metadatainformation to allow for a set of experiments to be reproduced on a similar computing system. Thisinformation must contain the environmental conditions, software tools used, and input parametersfor experiments. The results and analysis produced from a given set of experiments are linked withexperimental information. The preservation of the above information allows services to be builtaround thorough investigations of the computational epidemiology domain. The SimDL frameworkhas a domain-free design and software implementation. It targets the provisioning of services whichautomate human-intensive tasks regarding scientific content. We designed, implemented, deployed,and tested the SimDL framework to meet the objectives of the simulation community, prove ourthesis, and answer related research questions.

1.5.2 Thesis

The hypotheses of this dissertation are:

• Simulation-related content, users, user tasks, DL services, and toolkit implementations can bedefined in formal notation.

• The SimDL framework can provide efficient and effective automated services to supportsimulation-based content, users, tasks, research, and infrastructure, as specified in the formaldefinitions.

These hypotheses lead to research questions including:

• What constitutes scientific simulation content, communities, and anticipated tasks?


• What necessary services can SimDL provide that are absent or inefficient in existing DLsoftware?

• How are the efficiency and effectiveness of SimDL services evaluated?

1.5.3 Research Contributions

The SimDL framework and approach taken provides a foundation for formalizing minimal needswithin a scientific context, minimal software toolkit implementations for supporting several SimDLinstances, abstractions for adding new services, and infrastructure for servicing requests. Researchgroups in other domains could use this approach to define, design, and implement similar DLinstances in other fields based on the SimDL specifications described in this dissertation.

This research makes the following contributions to DL research communities:

• digital libraries described better than before (e.g., 5S implementation details, scientific &simulation context, and minimal epidemiology and network science requirements);

• formal 5S definitions for a class of scientific digital libraries;

• automation for scientific services and SimDL instance construction;

• existing services & collections for building additional services;

• definitions and implementations of novel simulation-based digital library services;

• efficient indexes of structured simulation content;

• metadata and curation processing grammars;

• potential DL instance offloading to virtualized HPC resources; and

• foundation for a simulation-services registry.

For the network science and computational epidemiology community, this work contributes:

• a deployable SimDL generation framework to support simulation infrastructures and work-flows;

• semi-automated metadata description set generation tools;

• fully automated metadata extraction and record generation tools;

• preservation, archival, and curation grammars and tools;

• partial data management and sharing plan fulfillment;


• multiple fully-fledged practical SimDL instances; and

• evaluation of the benefits and performance of SimDL for the scientific community.

Chapter 2

Related Work

Two decades of research in the field of digital library have been productive. Several open sourcedigital library packages have been produced. Examples of these packages include DSpace[106],Eprints[139], Fedora[119], and Greenstone[59]. Each of these software packages have been utilizedin hundreds of case studies across a variety of domains. In certain situations, such as full textdocument repositories, the existing software performs admirably. However, these DL environmentsare neither suited nor capable of providing DL functionality to support simulation science.

The digital library community is in the early stages of supporting the diverse contexts, efforts,workflows, activities, roles, and content found in the research community. The scientific datamanagement community is broadly conducting research in data management, archival, access,sharing, and curation. However, progress towards a well defined simulation-supporting frameworkor open source digital library has been elusive. Producing SimDL required advances to existingdigital library theory, approaches, and technologies. Historically, there have been few scientific-specific DL service implementations. Until now, there has been no available science-supportingor simulation-supporting digital libraries. In this work, we have produced SimDL, a generic,extendable, and formally-defined DL framework with the minimal set of simulation-supportingservices.

In [70], Tony Hey describes an emerging paradigm of data-intensive science. Science has progressedto a data exploration age from previous empirical, theoretical, and computational paradigms. Inrecognition of this change, funding agencies such as NSF have now required data managementplanning [73]. Simple reuse of available open source DL packages is not feasible due to theincreasing data collection limits and rate of science. This is demonstrated by the petabytes ofdata produced by CERN’s Large Hadron Collider, Australian Square Kilometre Array of radiotelescopes, and astronomy’s panSTARRS array of celestial telescopes. New search algorithmsare required that utilize technologies such as MapReduce for distributing metadata indexes anddistributing computations. Beyond an assortment of science-specific services, a knowledge drivenresearch infrastructure is called for. The end goal of these efforts is open science, or transparencyin experimental methodologies, observations, and collections of data [12]. It is assumed that this

10

Jonathan P. Leidig Chapter 2. Related Work 11

will eventually led to public availability and reusability of scientific data, public accessibility andtransparency of scientific communication, and web-based tools to facilitate scientific collaboration.This work also claims that scientific research and data belong to and should be managed as a ‘bazaar’from the village, cathedral, and bazaar analogy. Some progress has been made in data-intensivescience, demonstrated in changing models of chemistry research revolving around data accessand virtual collaborations. Historically, journals have recorded the ‘minutes of science’ throughpublication repositories [149]. This worldview has led to a system of dark data, or data that has beenproduced but is invisible and lost to science [67]. In data-supporting digital libraries, the goal is torecord the methodology and content of scientific efforts, not simply records of ‘finished’ products.

Digital library research started in the early 1990’s at locations including University of California,Harvard University, Indiana University, New York University, University of Michigan, and Uni-versity of Virginia [58]. Some of this was based on efforts in earlier decades, e.g., University ofMichigan’s Michigan Terminal System, University of Colorado’s TAXIR system, and Universityof Michigan’s MICRO Information Management System. This work deviates from this traditionof DL theoretical research and DL instance delivery. Historically, DLs were delivered throughthe following process, a) planning development; b) purchasing hardware resources; c) planningnetwork infrastructure and hardware; d) planning storage systems; e) developing systems andUIs; f) managing information, metadata, and document formats; g) building collections; and h)integration with other electronic services and functionality [35]. In this work, we are following analternative path of formally defining aspects to a class of DLs, implementing a software toolkit fromthese definitions, and deploying case studies through the software. Our previous publications havepresented background reviews on pertinent scientific data management systems, studies analyzingand assisting scientists, simulation environments, and formal reference models [87, 88, 117].

2.1 Researchers, Users, and Roles

2.1.1 Users, Researchers, and How to Understand Scientists

There have been several notable efforts and surveys to understand how to support science byinvestigating scientists, researchers, and research organizations. In [159], a survey of 400 digitallibrary managers reported that 47.5% spent time creating metadata records; 40% developing policies;and 32% managing controlled vocabularies, thesauri, and taxonomies for a domain. 58% of digitallibrarians have major concerns with understanding scientific workflows, 58% with the quality ofexisting metadata structures and record formats, and 56% have concerns with how search functionscan make use of metadata. Additional analysis identified significant concerns with controllingvocabularies, standardization, interoperability, extensibility, multilingualism, staffing, outreach, andtraining tools for metadata. A separate survey identified principal costs for groups that utilize DLs,i.e., .40 spent on commercial and produced content, .23 on equipment and infrastructure, .18 on DLpersonnel, and .07 on content creation systems [58]. Surveys into the patterns of information use inthe life sciences have identified a large gap in the actual practices by scientists and the guidelines of


funding agencies [112]. This study may be utilized to improve understanding of information useand exchange in the life sciences, including epidemiology. These efforts inform the higher levelevaluations of scientific digital libraries and support the development of a reusable SimDL.

There have been numerous case-studies specific to understanding researchers in data-intensiveprojects. [15] models individuals as data authors, managers, scientists, users, and funding agenciesand describes the key responsibilities and practices of each group. The MESSAGE project was ascientific effort covering pollution sensors. [37] summarizes a case study of the data-related aspectsof the work, e.g., datasets must be considered from a diverse set of researcher perceptions basedon user roles and biases. [42] describes the research practices at the University of Michigan. Thepublication covers what types of data management activities researchers engage, how researchersthink they could improve their practices, and in which DL services they are interested. [47]identifies roles that library science liaisons may engage in within an organization, i.e., analyzingdataset requirements, management planning, teaching practices to students, and setting preservationstandards. [68] looks at the manner in which scientists and scholars are currently producing,storing, and disseminating digital objects; the drawbacks of these practices; and how librariescould better manage intellectual output. [77] takes a similar outlook on the dangers of scientistsapplying information management techniques in an ad-hoc manner as opposed to first making useof the traditional experiences of library information scientists in managing digital objects. FISHnetattempted to produce a freshwater information sharing network [135]. In FISHnet, researchers werestudied to discover how they use and manage data as well as their concerns for data protection andcopyright. [85] presents a case study of the receptivity of certain types of scientific researchers tolibrary involvement in curation practices and the attributes that lead to receptivity. [101] arguesthat data repositories may have a larger impact than the institutional repositories that have not quitehad the initially anticipated impact on changing research patterns across all domains. A numberof UK-based research data repositories initiatives and aspects are highlighted. [102] presentsexperiences into understanding how user, i.e., researcher, participation in data curation efforts canbe encouraged. [120] assessed current academic data management practices and discovered thatresearchers do not need physical storage capacity but assistance with funding agency requirements,grant proposal explanations, finding data-related services, and publication support. [84] presentschallenges to librarian practitioners including content appraisal and selection, retention, description,preservation, authenticity, and compliance. [134] presents a series of interviews across 30 scientificprojects to discover issues relating to the provision and adoption of e-infrastructures. [151] presentsa study into how soft-support instead of additional technical solutions can be utilized to solve the‘people-problem’ side of scientific data management. They found that researchers organize data inad-hoc manners, store data without security and backup, are positive about sharing data but slow todo so, and equate backups with long-term preservation. [48] looks into how scientific researcherscan be modeled and compared based on similar research activities. These case studies provideinsight into the current gaps between the desire of researchers to preserve scientific content, suitableDL systems, and available training.


2.1.2 Education

Across multiple domains, there has been a systematic push to educate the next generation of scientistsin data management and information sciences. Many of the current efforts have been funded byacademia and organizations such as NSF and JISC. Mantra offers research data managementtraining online at the Ph.D. level1. DataTrain provides data management training modules forpost-graduate courses in Archaeology and Social Anthropology at the University of Cambridge2.DATUM provides research data management training for postgraduate students in the health studiesdiscipline3. DMP-ESRC provides monitoring and planning for economic and social research indata-rich investments4. DMTpsych provides postgraduate training for research data management inthe psychological sciences5. The Applied Quantitative Methods Network (AQMeN) aims to buildcapacity in the use of intermediate and advanced level quantitative methods amongst the socialscience community6. [115] describes areas in which information specialists can make contributionsto research practices in Biology. In addition, a graduate curriculum for Biological InformationSpecialists is also presented. These efforts, and others, compose a laudable effort to shift perceptionsand user expertise along with technological advancements.

2.2 Notable Related Projects

Previous work has detailed the notable scientific research data initiatives in Australia, France,Netherlands, UK, and Italy, and is briefly summarized here [10]. In Australia, PANDORA providescollaboration and coordination on collection policies as well a collaborative archiving. The NationalLibrary of Australia’s Digital Objects Management system manages long term preservation andaccess to digital content. There is also a national Australian Digital Resource Identifier scheme. InFrance, preservation technologies are developed by the Institut National de l’Audiovisuel. In theNetherlands, the national library, Koninklijke Bibliotheek, is conducting long-term preservationstudies and has produced a digital deposit system for electronic publications (DNEP) with agoal of researching preservation. The Dutch Ministry of the Interior has produced a DigitalPreservation Testbed. The Netherland Institute for Scientific Information Services has developedarchives tasked with curation and providing access to primary research data in biomedicine, socialsciences, history, and languages. In the UK, the British Library produced the Digital Library Store,DLS. The Joint Information Systems Committee Digital Preservation Focus (JISC) has fundeddigital preservation initiatives in academia and is responsible for many of the projects relatedto SimDL. The Data Archive for the Social Sciences has a history of storing research data backto the I970s. The CAMiLEION research projects aims to provide emulation and migration as

1http://datalib.edina.ac.uk/mantra2http://www.lib.cam.ac.uk/preservation/datatrain3http://www.northumbria.ac.uk/sd/academic/ceis/re/isrc/themes/rmarea/datum4http://www.data-archive.ac.uk/create-manage/projects/JISC-DMP5http://www.dmtpsych.york.ac.uk/6http://aqmen.ac.uk


preservation strategies. The Cedars research projects provide persistent identifiers to data. In Italy,PRESTO from Radiotelevisione Italiana provides preservation for audiovisual media and mediums.Internationally, the Networked European Deposit Library has defined preservation metadata fordigital publications. The Electronic Resource Preservation and Access NETwork ERPA-NETprojects were efforts in the preservation of scientific digital objects back in the early 2000s.

Efforts to enhance scientific data management practices have been a major focus for e-Science andcyberinfrastructure groups over the last several years. Major sponsors of research and projects inthis area include NSF, ARL, and JISC. Several examples of scientific data management includeearthquake simulation repositories [78], embedded sensor network DLs [18], community earthsystems [38], D4Science II [23], mathematical-based retrieval [160], chemistry systems [96],national research data plans [82], and science portals [107]. Each of these projects attempt toprovide e-Science software to support data management, arching, management, access, and sharing.

Several recent projects demonstrate the variety and current state of the art in scientific data manage-ment, preservation, and dissemination. FISHlink was a JISC funded project to demonstrate howdiverse freshwater biology data can be combined, repurposed, and reused7. The project providedmetadata attribution, provenance information, sharing data, and integration through vocabularies. Aproject in gravitational waves produced a data management and data sharing plan for big science,specifically astrophysics collaboration research groups [56]. The US National Virtual Observatory(NVO) is a large scale system that provides access to data and computing resources [114]. NVOmakes it easy to locate, retrieve, and analyze data from archives and catalogs worldwide. Its goal isto make it possible for astronomical researchers to find, retrieve, and analyze astronomical datafrom telescopes across the globe. The project had success in building a large community of researchorganizations and professional scientists in this domain. It is one of the few projects that has builtefficient services for massive data sources. However, the software and infrastructure utilized herehas not been generalized and reused in the broader scientific community. The NASA PlanataryData System archives and distributes NASA’s scientific data from missions and observations8. Thissystem has fully integrated datasets organized into nodes, subnodes, and data nodes distributedamong dozens of organizations around the United States. Similarly, NASA’s Earth Observing Sys-tems project provides thousands of datasets to the general public9. The ESO Very Large Telescopeand Science Archive System additionally performs routine computer aided feature extraction fromraw data and allows data mining on both data and extracted parameters10. Unfortunately, few otherscientific domains maintain as much community support and governmental funding to developsubstantive big science portals.

Several additional projects are worth noting as they each presented a small subset of the designdecisions also used in the SimDL framework. According to Greenberg, ‘‘data sought for repositorydeposition is being generated at a much faster rate than metadata can be created for representing,organizing, and accessing data’’ [57]. Their approach to this problem lead to the development

7http://www.fishlinkonline.org8http://pds.jpl.nasa.gov9http://eospso.gsfc.nasa.gov

10http://archive.eso.org/cms


of Dryad. Dryad is a repository that automatically generates DOIs, file description metadata,and Dublin Core terms. However, user information metadata generation is semi automated anddescriptions are manually acquired. This repository takes several approaches we reuse in SimDL,e.g., XML Schema metadata representations. However, Dryad does not provide the functionalityrequired by the simulation community, e.g., fully automated metadata extraction. The NationalGeospatial Digital Archives produced guidelines for file formatting and archiving based on ex-periences in building large digital data archives from multiple data providers [40]. Specificallythey describe issues such as versioning, file formats, handling proprietary formats, dataset size,ownership, compression, generating collection-level metadata, generating digital object (DO)-levelmetadata, and format registration. The Data-PASS project aims to systematically identify, archive,appraise, and acquire social science datasets for preservation purposes [62]. They were able todemonstrate that a community is capable of archiving large amounts of content within a set of loosepartnerships when a medium for collaborating and managing data is present. [146] describes thedifferences in document repositories and data repositories and demonstrates how data repositoriesare required to capture, in addition to data, information about the data, its creators, overviews,and various aspects related to proof (e.g., data validity). [144] describes the Australian NationalData Service, a large-scale framework that supports creation, capturing, managing, and sharing ofresearch data. This is done through a collection registry, persistent identifiers, policies, schemas,metadata catalogues, harvesting guidelines, and search services. [17] details how ‘little science’,e.g., habitat ecology researchers, are beginning to use embedded sensor networks and DLs to cap-ture, organize, and manage large numbers of DOs. The effort describes how data practices evolvewith data-intensive generation mechanisms. The work also discusses curation and disseminationpolicies, lack of domain data standards, lack of interest in utilizing other groups’ data, and concernsabout ownership and quality. [22] describes the DILIGENT approach to DL generation wherebasic services are combined into application-specific DL instances. [21] further elaborates bydescribing the integration of DL and grid technologies and resources within the Earth Sciencecommunity. DILIGENT is notable in that it allows communities to build specialized DLs thatsupport the whole lifecycle of content production and use in user defined workflows. Ongoing workin content-based image retrieval (CBIR) has produced algorithms for automatically extracting andindexing features found in multimedia, such as scientific data images. One CBIR project in fishspecies identification extended the 5S framework to provide a formal foundation for their workand allow for application-specific definitions of CBIR digital libraries, see [110, 111]. Our SimDLframework takes a similar approach and also builds upon facets of each of the other projects.

2.3 Institutional Digital Libraries

Digital library systems greatly differ in terms of their scope, functionality, and scalability. Digitallibraries for individual units or projects are globally prevalent and domain-specific, e.g., a Geo-science department’s DL for managing rocks, minerals, and core samples [141]. These DLs areheavily tailored for a domain, provide complete coverage for required services, and are generally


not reusable in other areas. Institutional digital libraries attempt to provide general services for awider community while not being able to provide domain-specific services such as image-basedsearch, e.g., Purdue University Libraries. These DLs are typically found in universities where thegoal is to provide a minimal level of research data management for faculty without the ability tosupport domain-specific metadata. In [156], the issues with institutional DLs are discussed, e.g.,bottlenecks preventing community acceptance of a DL and lack of data sharing. The University ofEdinburgh DataShare project provides a system of academic data sharing and management11. TheInstitutional data management blueprint describes a DL for managing generic institutional data inhigher education institutions [19]. [34] describes the University Digital Conservancy (UDC), theUniversity of Minnesota’s DL. The UDC is charged with data stewardship, archiving, educatingscientists, building scientist portals, managing data through policies, and gaining TRAC repositorycertification. The developers of UDC are attempting to support e-Science and data services throughthe university repository. It remains to be seen how these types of DLs are able to support the fulldiversity of scientific domains at a reasonable level of content granularity.

It is not entirely clear which human-intensive tasks institutional DLs must minimally support.[127] discusses the pressure on researchers to provide quality data management and sharing plans.It also discusses university reluctance to incentivize and support the generation of facilities forresearch data management and curation. Funding organizations wish to add value to expensivecontent, repurpose data across disciplines, produce transparency and reproducibility of publishedworks, and release funded work in the public domain. However, the activities that lead to such aDL is still unclear for universities and individual researchers. [138] presents an extensive surveyof approaches to e-Science and available services in both institutions and libraries. Multipleinstitutional DLs are then discussed in detail. One finding is that 9% of organizations wereattempting to generate institute-wide e-Science DLs, 25% were focusing on supporting individualunits only (e.g., departments), 61% were planning a hybrid structure of general institution-wideDLs with additional specialized unit DLs, and 49% were planning on collaborating with otherinstitutes to produce a partnered DL that would be deployable at multiple institutions. With theSimDL framework, we produced a context-free unit digital library that could be used across domains(similar to an institutional DL) while allowing each context and infrastructure to directly dictatethe domain-specific metadata schemas in use. The automation, or reduction, of institutional DLgeneration efforts by organizations without extensive resources or DL expertise is closely related tothe Virtual Digital Library Generator [3]. The VDL Generator provides DL designers with a serviceto define a DL, produce a logical component plan, and yield deployment plans to be replaced withphysical components. In comparison to the VDL Generator, SimDL concentrates on providing andintegrating software components for formally defined individual services. We use formalisms toprecisely characterize the DL to highlight the emergence of simulation-supporting services in ourframework and in comparable related works.

11http://datashare.is.ed.ac.uk


2.4 Scientific Data Management

There are multiple domain-specific services and projects that were developed to support an aspectof scientific data management. Many groups have undertaken efforts to create, maintain, andutilize data management and sharing plans, particularly organizations associated with the DigitalCuration Centre [36]. These data management plan efforts have produced a large number of toolsfor describing management plans but have not yet provided a consensus on quality tools in this area.The Data Management in Bio-Imaging project is similar to SimDL in that it attempts to supporthigh-throughput bio-imaging systems [5]. The project had success in producing a ‘Bio-Imaging’service that tracked images, software, software input parameters, and results, as well as linking therepositories of two scientific projects. While this effort parallels SimDL’s support of simulationsystems, the minimal SimDL toolkit requires an expanded set of services not available in thatwork. The History Archaeology Linguistics Onomastics and Genetics (HALOGEN) pilot projectattempted to support repurposing of data management for research data between humanities andgenetics researchers [140]. The project included support for a content lifecycle similar to SimDL,albeit without the services required for simulation data. However, it provides a case study injoint data management plans and allowing collaboration between researchers in different domains.The Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO) projectaimed to integrate data management plan tools12. The produced toolkit integrates data managementplanning tools including DAF, AIDA, LIFE and DRAMBORA. This is related to our work inintegrating individual scientific services into a larger toolkit but is concentrated on providing auser selectable data management planning service through multiple service implementations. TheMaDAM and MiSS project aimed to provide a university-wide data management service andinfrastructure initially integrated for Life and Medical Sciences researchers [121]. The serviceprovides data capture, storage, and curation. The results for deploying the service for the Life andMedical Sciences lead to inputs for applying the service to other domains, but progress has notyet been made in applying the service to the whole university or across domains. The SupportingProductive Queries for Research (SPQR) project utilized linked data, RDF triples, and ontologiesto allow multiple types of content and media, from classical antiquities, to be queried as complexobjects [76]. There is a direct parallel with our scientific workflow generation of a stream of relateddigital objects. This project attempted to link DOs with similar content whereas SimDL links DOsbased on provenance information from a structured workflow. The ARCHER project produced atool for supporting sensors and scientific instruments that allows researchers to personally managethe collections of content they have generated and share content through workspaces [2]. While thisdata management package allows for the management of experiments by individual researchers, itdoes not provide the set of minimal services required to support backend simulation infrastructures.Karasti describes the LTER program’s approach to providing global curation services withincommunities willing to contribute to open data, e.g., long-term ecological research projects [80].While these techniques, tools, and approaches have been produced, the DL community has notproduced a generic toolkit for mashing these services and approaches. Additionally, services that

12http://www.dcc.ac.uk/projects/cardio


are appropriate for one domain are not necessarily generalizable for science as a whole. As such, amethodology for precisely defining services is required to inform service selection in DL buildingefforts.

2.4.1 Concerns

Progress in developing a class of science-supporting digital libraries has been hindered by a varietyof concerns and requirements from diverse scientific communities. A DL that is acceptable forone community might not address the concerns of other groups. There are a host of externalthreats to digital collections [130]. These include media failure, hardware failure, software failure,network problems, media and hardware obsolescence, software obsolescence, human error, naturaldisaster, attack by hackers, lack of funding to continue maintenance, and demise of the hostinginstitution. Content publishers often have concerns over confidential records, privacy, copyright,restricted access, data corruption, and data validity [79]. Harvey presents the following range ofadditional problems with data management systems [65]. Technology obsolescence complicatespreservation services as computers and software are updated frequently and lead to the inabilityto access data. Technology fragility leads to inaccessible data when bytestreams are changed orcorrupted. Curation is hindered by a general lack of understanding of what constitutes good practices,inadequate technologies, low organizational priority, skill-sets of scientists, and uncertainties aboutthe best organizational infrastructures. Anderson presents an organization of a multitude of scientificdata issues [1]. This organization describes domain-specific, organizational content management,content policy, and technical issues, and how these concerns form into barriers to developinggeneralizable scientific data management systems. The direct and indirect cost of managing andpreserving data is difficult to estimate and depends on long-term forecasting of staffing, equipment,acquisition and deposit interfaces, file formatting, processing, metadata indexes, access processing,refreshing, versioning, copying, migration, storing, corruption monitoring, intensity of utilization,media replacement, and disaster recovery [11]. It is often unclear how to set policies in large-scaledata production environments for content selection criteria, purging, advisory panels, and scope ofaccess, e.g., remote sensed data [41]. Different domains have a variety of specific challenges andconcerns related to instrumentation, content generation mechanisms, infrastructures, and acceptanceof community standards [118]. DLs mitigate many of these concerns. However, DLs must be placedwithin the larger socio-technological environment to identify the aversion to and implausibility ofDL integration in certain settings.

2.4.2 Scientific Content Repositories

A limited number of DL solutions are available to provide repositories for data underlying scientificpublications. Common DL software such as DSpace, Fedora, Eprints, and Greenstone can beconfigured to use domain-specific metadata schemas and provide a generic store of an arbitraryattached digital object. However, these types of deployments are not robust in satisfying research


processes and scientific infrastructures with domain-specific extensions and modules. Dryadprovides some support for data and metadata via Dublin Core application profiles and DSpace [57].Several institutional repositories have been utilized for sharing content in small science efforts acrossmultiple domains [27]. Project StORe provided a middleware for resolving articles found in outputrepositories with original and derived data located in source repositories. [123] The Purdue Libraries’distributed institutional repository has been utilized in investigations of the relationships betweenraw data, structured data, publications, and scientific communities [156]. Universities in Australiahave had some success in building repositories for raw data and collaborative research activities, e.g.,ARROW, DART, ARCHER [145]. This work found that implementing multiple repositories is thebest suited approach to separately support collaboration on research and publications. In the futureof e-Science, multiple flavors of DLs and repositories are going to be needed to manage datasets,software, and publications based on the requirements of individual domains. Warner proposesan interoperability framework to join systems such as SimDL with aDORe, ArXiv, DSpace, andFedora [152]. From these examples, we can surmise that scientific communities generally recognizethe need for DLs of scientific content. It is unlikely that a single DL will be suitable for a broadrange of potential scientific DL instances. The scope of many of these DL efforts is more restrictedthan our goals in producing a DL for context-free, simulation-based research.

2.4.3 Standards

While current efforts in the DL community do not indicate the emergence of a standard institutionalDL in the near future, technologies and standards underlying scientific DLs implementations isincreasing. Dahl provides a summary of web and metadata standards [29]. Sample web standardsfor automation include MARC to transfer records; XML markup languages to structure andcommunicate data; LDAP, Shibboleth, and Athens to authenticate users; OpenURL to transferbibliographic information; and Z39.50, SRU/W, OpenSearch, and OAI for search and retrievalprotocols. Metadata standards include XML style syntax; the Library of Congress’ MARCXMLframework for expressing full MARC records in XML; Metadata Object Description Schema(MODS) for a simplified version of MARC expressed in XML; the Dublin Core and VRA core setof standard metadata elements; Encoded Archival Descriptions (EAD) for encoding archival types;and Metadata Encoding and Transmission Standard (METS) for an XML standard to combinedescriptive, structural, and administrative metadata together within a single record. Rhyno detailslicensing, discovery services, and underlying constructs in DLs [126]. Licensing options includethe GNU General Public License GPL, Creative Commons, GNU Lesser General Public LicenseLGPL, Artistic License, Berkley System Distribution BSD license, Apache Software License,MIT License, NCSA License, Mozilla Public License MPL, Netscape Public License NPL, andOCLC Research Public License. To describe metadata schema, Global Information LocatorService (GILS) describes a standard for organizations to describe collections. Content Standardfor Digital Geospatial Metadata (CSDGM) provides terms and definitions for documentation ofdigital geospatial data. Learning Object Metadata (LOM) from the IEEE Learning TechnologyStandards Committee is used for education objects. The Online Information Exchange (ONIX)


defines publisher-generated metadata similar to the MARC standard. The Gateway to EducationalMaterials (GEM) is based on Dublin Core and adds controlled vocabularies for education includingsubjects, resource type, pedagogy, audience, and format. XMLFile from Virginia Tech is a tool thatexposes a set of XML files in a directory as an Open Archives Initiative (OAI) data provider. XMLauthoring tools include Amaya, Pollo, Xerlin, KOffice, OpenOffice Writer, Cooktop, and Jedit.Popular underlying databases include MySQL, PostgreSQL, SAP DB, Interbase/Firebird, HSQL,Berkley DB, DB2, and Oracle. Object and XML stores include eXist, 4Suite, and Ozone. Thesetechnologies have been utilized in many DLs and provide general purpose mechanisms for storing,indexing, discovering, and sharing content.

Miller describes current metadata standards and practices [105]. Metadata is often described interms of administrative, structural, and descriptive lines, where administrative metadata is composedof technical, preservation, rights, and usage terms. Common metadata services include identifying,retrieving, searching, browsing, metadata sharing, metadata harvesting, metadata aggregating (e.g.,mapping and crosswalks between multiple schemas), metadata conversion and processing, and as-sessing quality. Metadata takes on several levels of granularity, including collection-level metadatain MARC as well as domain-specific, item-level terms. Efforts in defining a metadata description setmay reuse of widely accepted, standard terms found in the Dublin Core Metadata Initiative, Libraryof Congress Subject Headings (LCSH), medical subject headings (MeSH), and the Dewey DecimalClassification DDC. Metadata schema can be analyzed based on accepted definitions of about-ness,of-ness, is-ness, facets, exhaustivity, specificity, and ambiguity of possible values. Ambiguity isoften tackled through synonym rings, authority files, lists, taxonomies and classification, thesauri,subject headings, and vocabularies. Controlled provenance vocabularies defined by DCMI includeis version of, has version, is replaced by, replaces, is required by, requires, is part of, has part, isreferenced by, references, is format of, has format, and conforms to. Alternatively, MODs definesprovenance vocabularies for preceding, succeeding, original, host, constituent, series, otherVersion,otherFormat, isReferencedBy, references, and reviewOf. Additionally, VRA 4.0 defines relatedTo,partOf, largerContextFor, formerlyPartOf, formerlyLargerContextFor, componentOf, componentIs,partnerInSetWith, preparatoryFor, basedOn, StudyFor, studyIs, and many more relations. Toimprove metadata quality and interoperability, the authors suggest organizations ‘‘1) use DC andother standard elements correctly, 2) include sufficient contextual information and access points, 3)enter data values that are machine-processable and linkable, distinguish administrative and technicalmetadata from descriptive metadata, 5) document your local practices." Developing a metadataschema requires considering the schema’s context, defining functional requirements, developingelement sets, identifying general terms and collection-specific terms, establishing controlled vo-cabularies and encoding schemes, developing content guidelines, documenting in a scheme orapplication profile, and creating a metadata registry or namespace. Widely used metadata practicesinclude the use of linked data and RDF triplestores. Their work also notes the emerging differencesin linked data, databases, digital collections, DLs, digital archives, digital repositories, institutionalrepositories, and content management systems.


2.5 Summary

These projects and SimDL provide services for scientific content generation, curation, represen-tation, and management. However, SimDL focuses on structured simulation processes. SimDLaddresses providing communication interfaces for external components, request submission, analy-sis submission, and data management for simulation systems. Additional workflows are integratedwith SimDL through added DL components that make use of staged content (e.g., additionalanalysis processes or data mining may make use of existing datasets). These parallel e-ScienceDL efforts aim to bring together interoperable e-Infrastructures to generate worldwide virtualresearch environments. The focus of SimDL is to provide DL services related to computationaltechnologies for individual researchers, institutions, and collaborators, not semantic integration ofresources. The scope includes automated DL generation, and simulation related content productionand management.

Chapter 3

Computational Epidemiology and NetworkScience

Modeling and simulation studies involve a statistical design, where a simulation model coveringcells, or scenarios (i.e., input parameter configurations), of the design is executed. Conductingexperiments consists of producing a model, developing software, generating a set of configura-tions, running the simulation application with a configuration on HPC resources, acquiring results,conducting analyses, and human interpretation of the results. Similarities exist in computational epi-demiology simulation workflows that can be supported and partially-automated through a DL. Myefforts in the application of digital library technologies towards the field of computational epidemi-ology were presented to the public health community through multiple conference presentations[89, 90, 91, 93, 94, 95].

3.1 Integrated Modeling Environments

As published in [8], modeling and simulation research is conducted in integrated technologicalenvironments. This type of environment is ideal for the utilization of simulation-supporting digitallibraries that supports multiple models, software, data, and researcher roles. The work publishedin that article provided the impetus to develop SimDL. See [8] for the full description of theseenvironments.

3.2 Models and Software

SimDL aims to support a diverse set of simulation applications. The initial case studies that guidedthe development of SimDL were primarily composed of computational epidemiology models andnetwork science analyses packages.

22

Jonathan P. Leidig Chapter 3. Computational Epidemiology and Network Science 23

3.2.1 Public Health Disease Models

Typical simulation models simulate the spread of diseases throughout human, animal, and plantpopulations. These simulations are used to predict the effects of potential strategies to mitigate thespread of an infectious disease to prevent epidemics. Differential equation-based models simulatea macro level view of epidemics. Agent-based modeling is used to provide detailed results at thehigh-granularity of individuals in the population. Three classes of computational epidemiologymodels that are initially supported by SimDL include contagious diseases, vector organism-baseddiseases, and bacterial infections within human tissues. The distinguishing features of these modelsstem from the type of disease transmission and propagation, e.g., person-to-person, vector-borne,and within-host.

Contagious Disease Models

The NIH Models of Infectious Disease Agent Study (MIDAS) research groups have produced awide-ranging collection of infectious disease models. Several of these models use the EpiSimdemics[9] and EpiFast [14] simulation platforms. These two platforms have been used in multiple studiescovering diseases that include H5N1, H1N1, smallpox, and pertussis (whooping cough). Thesediseases are spread by social contacts and environmental surfaces. The simulation models make useof synthetic populations of the United States based on census data, physical locations, and humanactivity modeling. Conducting a study using these platforms involves generating a population,implementing a disease model in software, developing an input configuration of required parameters,and executing the simulation software on large high-performance computing resources (e.g., clouds,grids, and clusters). Studies using these models are used by NIH and CDC policy makers to setmitigation policies during an epidemic (e.g., vaccine and antiviral distributions, school closures,and quarantines) [97]. Similar models consider the distribution of healthcare workers within givenlocations and the spread of diseases throughout the area [28].

Vector-Borne Disease Models

A parallel corpus of disease models, specific to malaria, have been produced through support by theBill and Melinda Gates Foundation and many others. These vector-borne disease models includeOpenMalaria [137]. This model also uses agent-based modeling to simulation the movement ofindividuals and their interactions with infectious vectors (i.e., mosquitoes). Studies using malariamodels are typically conducted for the benefit of developing countries at high-risk to malaria.Public health officials in developing countries directly benefit from web-portals by reviewing publiccontent produced in simulation systems. The findings of computational epidemiology researchersare captured through the digital library and may be used to guide pharmaceutical and healthcaredelivery in developing and rural areas, similar to efforts in [39, 60, 122]. It is important to note thatthese disease models and simulations are based on observations, surveys, and datasets gathered inthe field by local health officials. The number of disease models produced, along with versioning of


individual models, has led to the need within this group for generic interfaces, submission systems,and data management components.

Immunopathology Models

Within a human, models can be used to predict the inflammation and regulatory immune pathwaysas individual cells interact with foreign bacteria. The ENteric Immunity SImulator (ENISI) modelsimmunopathology (e.g., ulcer, inflammation, diarrhea, etc.) considering microbes, mucosal immuneresponses, and multiple cell-types in tissues such as the gut [154]. This application can be used bymucosal immunologists to test and generate hypothesized mechanisms for clinical enteric diseaseoutcomes given in vitro (laboratory) observations. Populations are represented by groups of individ-uals with the same cell-type and tissue site location. Each cell-type (e.g., macrophage, dendritic cell,CD4+Thelper cell) that occupies a specific immunological state is represented as a population whoseindividuals move among tissue sites. Each cell changes its location depending on type-specificrules. When in the same location, cells are considered in contact and may change immunologicalstate depending on a specific set of rules for interaction. Individual movement and interactions arespecified, the simulation is run, and high-level behaviors are observed. Experiments are conductedby modifying specifications, to represent specific laboratory conditions, and observing the effecton the net immune response. An example study using this tool reproduced a typical inflammatoryresponse to B.hyododysenteriae as well as the immunopathological effects of autoimmunity tocommensal bacteria. Analysis of simulations have highlighted key pathways by which chronicinflammation persists following B.hyododysenteriae elimination and epithelial cell death.

3.2.2 Simdemics

My research involving SimDL was initially started by the need to support the Simdemics modelingand simulation platform. These applications include EpiSimdemics [9], EpiFast [14], and EpiNet[25]. As published in my earlier work [7], the Simdemics class of modeling applications allows fora large body of scientific content to be produced after substantial efforts in developing multi-agentcomputational epidemiology models. In addition, a large number of network science algorithmsand measures have been added to the research infrastructure that hosts Simdemics. This class ofnetwork science packages led to the production of a SimDL case study specific to network science.

3.3 Case Studies

The research has been practically demonstrated in case studies. Epidemiology studies are conductedby comparing the results of simulations constructed with slightly varying input parameters, i.e.,scenarios. SimDL services were designed and implemented to support the workflows of contentproduced by studies conducted within the NDSSL [6] and Swiss Tropical and Public Health Institute


[137]. The architectural components in each of these digital library implementations includecollections of diverse content, metadata record collections describing individual digital objects, aset of middleware services, metadata collection APIs, broker APIs, brokered connections to userinterfaces, and brokered connections to other infrastructure components. These case studies arerepresentative samples of simulation workflows similar to those seen in wet-laboratory managementsystems and high-throughput bioinformatics systems.

Development and extension of SimDL services was initially an effort between two computationalepidemiology research groups at Virginia Tech, USA and the University of Basel, Switzerland.Each group required SimDL to be integrated in different types of environments, namely a cyber-infrastructure for multiple simulation models as well as a simulation infrastructure local to anindividual institution. The diversity of disease models, available computational systems, and userrequirements led to the development of a brokering system for DL communication. Several decadesof experience by each group identified the minimal set of required services for a scientific digitallibrary. The resulting SimDL has been integrated with varying levels of coupling to brokers andother infrastructure components. SimDL has proven useful in forming collections from simulationworkflows for both environments. These collections have formed repositories of content useful forinfrastructure components such as user interface presentation and graph analysis platforms. SimDLhas since been integrated in case studies including CINET and OpenMalaria.

3.3.1 Case Study: CINET DL

One case study SimDL instance supports a cyberinfrastructure for network science research andcomputer science education (CINET). Researchers in many domains deal with large networkgraphs, e.g., bioinformatics, transportation, public health, and social sciences. Educators teachgraphs and networks in many domains. In computer science, the core curriculum consists ofseveral lecture hours regarding graphs and trees in the area of discrete structures. Many otherdepartments offer graph theory courses along with other courses that deal with large networkgraphs. In addition, there are many graduate network science courses that use the same types ofcontent for research and education. These types of courses require real-world, large-scale graphs,analyses algorithms, and computational resources. We have developed a cyberinfrastructure fornetwork science education and research. CINET manages large network graphs, datasets, distributednetwork analyses algorithms (e.g., GALIB, NetworkX, and GDSCalc), and coursework materials.Educators can use the CI as a medium for conducting class projects, storing lecture materials,discussing best practices with other educators, and using pre-existing graphs. CINET also has aresearch and experimentation UI and portal that allows users to use CINET computing resourcesto conduct experimentation using scalable analysis software on existing network graphs. Thecyberinfrastructure may be used as a laboratory material in order to use real-world, large-scaledatasets in teaching and researching these topics.


DL Description

A SimDL instance was needed for information management, workflow support, and providinga variety of services. The cyberinfrastructure (CI) required services to manage network graphs,analysis algorithms, results, and logs of user behavior. APIs and brokers were needed to supportconnection to user interfaces and a blackboard system, i.e., Simfrastructure. Through the DLservices, the CINET system is able to receive graph and software contributions from users, computenetwork analyses, reuse content, and incentivize the network science community. The aims forthis community were to work collaboratively on generating a repository for datasets, conductingresearch, and guiding educational activities.

3.3.2 Case Study: Open Malaria DL

Modeling and simulation research groups in the public health domain often duplicate efforts inbuilding highly similar simulation infrastructures. Study designers and policy makers using thesesystems infrequently have proper technical training in building properly formatted simulation inputconfigurations, connecting to computational systems, moving datasets to appropriate systems, andinvoking simulation models through command line interfaces. A generic job submission system maysupport these research groups by automating many of the technical, routine tasks required to conductcomputational studies and allowing new models to be plugged-in to an existing infrastructure. Thejob submission system provides a generic user interface portal with connections to simulationengines, analysis processes, computational resources, and data management components. Schemasthat describe simulation models’ input configurations are used by the interface to present simulationmodels to users and structure the design of experimental simulation studies. The submission systemmay be used to design studies, submit studies to the simulation system, gather results, executeanalysis scripts, archive datasets, and present the results and analyses back to the user through theinterface. Using simulation model schemas has allowed the submission system to support multiplemalaria and infectious diseases simulation models while minimizing human-intensive efforts toconnect new simulation models.

A generic Job Submission System (JSS) has been designed and implemented to provide the typeof infrastructure desired by the malaria-focused public health community. The primary modelinitially supported by the JSS and digital library instance is described in [137]. The infrastructureincludes generic components for exposing disease models, providing a user interface, buildinginput configurations, designing studies, requesting and executing simulation, gathering results,launching analyses and visualization applications, and data management. The user interface makesuse of a collection of schemas that describe individual simulation models to allow study designersto interact with disease-specific models. Data management is provided through a digital librarythat provides services to collect, index, manage, archive, curate, and search multiple collectionsof content as it is produced by the simulation infrastructure. The types of content stages includedmodel schemas, input configurations, sub-configurations, etc. The required basic services includestorage, management, access, browse, and retrieval. The collections are formed by capturing files


as they are produced by the simulation-based scientific workflow.

The JSS infrastructure consists of components for simulation model software, analysis software,computational resources, interfaces, and digital libraries. Each component uses brokers to com-municate with the other components through the centralized blackboard. Entities may be added toindividual components and directly utilized by the rest of the infrastructure. As an example, a newsimulation model may be added to the model component and will automatically be exposed throughthe user interface, executed on compute resources, and managed by the digital library.

A centralized blackboard and component brokers manage the communication of messages and datain the JSS infrastructure. In addition to passing messages between the interface and digital library,the infrastructure also automates the experimentation workflow. This approach is based on existingsystems outside of the public health domain that compose models into workflows [33, 99].

After receiving a simulation request, the brokering system is used to locate an available compu-tational system with the appropriate simulation software. The input files are transferred to theidentified system and the simulations are launched. Progress updates are sent back to the userinterface as individual simulation runs are completed for progress monitoring purposes. Afterthe simulation process has finished, results are transferred to result repositories for archival andforwarded to the analysis system. The raw result data is used by the analysis broker to invokeanalysis software that have been pre-defined to support a particular simulation model. Modelers andpolicy makers generally do not have extensive technical background or training. The infrastructure’sautomated workflow abstracts the need for novice users to fix formatting errors, transfer data toremote machines, and execute command line applications and analyses.

SimDL was designed to support the experimentation workflow for multiple models, including [137],within a reusable model-independent design. Fig. 3.1 displays the interface for gathering inputparameters to a malaria model. Collaborating researchers had developed simulation systems whichwere complex for non-model developers to execute. The ontologies were found to aid in a model’sUI presentation through the generic browser. A model-independent UI was built to rapidly exposenew disease models and present input, result, study design, analysis, and document collections.Brokers also were developed to support the UI in submitting input configurations, compliant withXML Schemas, to available computational systems.

DL Description

SimDL provides services including harvesting, indexing, searching, filtering, and archiving.Through blackboard connections, SimDL receives schemas, input configurations, subconfigu-rations, and study designs as they are saved by users. The content managed by SimDL residesin databases and file systems on computational platforms. The five types of simulation contentproduced by the simulation workflow, form the five collections managed by SimDL (e.g., schemasthat describe simulation models). In addition, SimDL manages the full factorial or sampling studydesigns.


Figure 3.1: Overview of a model schema (left), input terms for a selected level of the schema(foreground), and input configuration file being generated (right).

Sets of input parameters, or input subconfigurations, may be saved to the digital library. Thisallows domain experts in one field to contribute high-quality blocks of parameters that may bereused by novices without training in all areas of the modeling domain. As an example, malariamodels require sets of parameters for human populations, mosquito vectors, disease characteristics,monitoring processes, and public health interventions. Thus, entomologists, social scientists, andlocal health officials may collaborate asynchronously in designing high-quality studies.

A study consists of multiple full-factorial experiments that include a base file, multiple parametersto investigate, a list of values for each parameter, and list of previous subconfiguration input valuesto reuse. After a study design has been developed, the interface verifies the correctness of theparameter values, generates the input files, and submits a simulation request to the blackboardsystem. Additional components will receive the simulation request and automate the process ofproducing results and generating the analyses and plots applicable to the model schema beingused. The interface also provides a central portal for accessing the results and automated analyses.


As all of the information displayed in the interface is provided by the blackboard, generic ordiverse customized interfaces may be accessed through different URLs and connected with theinfrastructure. Lastly, language support for internationalization efforts are provided by a dictionarylookup for all interface terms and labels.

3.3.3 EpiSimdemics and Simfrastructure DL

The simulation applications showcased in early sections are well suited for support by SimDL, i.e.,EpiSimdemics [9] and EpiFast [14]. These applications are already integrated with the Simfrastruc-ture architecture and brokering system. As SimDL has been integrated with the same architecturethrough a DL broker, supporting these applications is a logical extension of the existing efforts inproducing SimDL case studies. In the Simfrastructure environment, SimDL can provide services tostore, manage, retrieve, and compare the sequence of digital objects produced through simulations.It also provides services such as searching, memoizing, incentivizing, reusing, reproducing, anddiscovering. Providing these information services would require replacing existing executionbrokers with DL extended versions. The modified simulation workflow invokes appropriate SimDLservices when processing certain events in each simulation workflow stage.

3.4 Datasets

The initial candidate datasets to be stored in SimDL center around epidemiological experimentsrelated to the previously discussed models. This requires storage of the shared and individualsimulation inputs as well as the entire set of outputs for each individual simulation. Going forward,any simulations conducted through SimDL with Simfrastructure, the infrastructure supportingthe Simdemics simulation software metadata are captured by prompting the user for a context-specific module of the specific scenario being simulated. A future version of SimDL may providea connection interface to link any external experimental simulation software system repositoryto SimDL, but this generality is not pursued in the initial implementation. Stored datasets areinvestigated for indexing and to extract required metadata values. The location of a compressedderivative of each file in the dataset is then stored by a Simfrastructure data manager and linkedwith the SimDL. On retrieval, each compressed file may be individually referenced and returnedfrom the identified location. Computer generated experimental data are already in a digital formatand make for suitable digital objects. This eliminates the need for digitization that often is a largepart of digital library efforts. As the output of experiments will likely be quite comprehensive, thedata may have to be compressed in some format. Some data, such as runtime system variablesor artifacts, could possibly be curated in storage if it can be determined that it will not be used orreferenced at a later date. Since the data is produced through some computing system, it is likelyalready in some structured organization of output, see Fig. 3.2 for a digital object representation.The information involved in scientific research processes may be formed into a structured stream


of content. This stream is based on the research workflow that produces these digital objects, seeFig 3.3.

Figure 3.2: Digital object model.

Figure 3.3: Structured workflow of content as produced by the research environments that SimDLsupports.

3.5 Metadata

Structured data resulting from an experiment can be stored cohesively as a single complex digitalobject (DO) file with attached metadata. This metadata partially takes the form of the systeminformation necessary to rerun the experiment as well as additional variables that are useful inrobust searching. To be specific, the metadata must capture information on the computing hardwarethe experiments were run on. In addition, information on the software used to run the experimentand the input parameters to the software system are necessary to conduct and differentiate specificexperiments. Since various types of experiments are to be stored within the same digital library, themetadata must be sufficiently expressive to store all of the possible experiments. Hardware system


related information may be common across numerous experiment sets. Thus, it is primarily thesoftware and input information that must be carefully constructed though hardware may affect theresults in some distributed agent-based systems. This metadata section provides enough informationfor a simulation scenario to be precisely duplicated a second time. Additional metadata is requiredto allow search on variables not included with the input or environment variables. To promoteinsightful search, context-specific metadata is used to capture scenario specific information into themetadata.

An agent-based discrete event simulator, such as in Simdemics-based systems, may be capable ofrunning simulations in numerous contexts covering varying phenomena. For example, Simdemicsis capable of modeling public health epidemics [4] as well as the spread of malware in wirelessnetworks [25]. It would be undesirable to maintain separate digital libraries collections for eachcontext and equally undesirable to force incompatible contexts into the same space. To maintainnumerous context-specific experiments within the same digital library, a context-specific module isrequired to allow management of various contexts. This module contains the entire set of metadatafields associated with a simulation scenario, including the input and environment variables. Thismetadata module contains significant variables that are extracted from the input and output files.These variables are defined within the context-specific module by the user group with domainknowledge of the specific context. After the selection of input and output variables, additionalcontext-specific and context-free variables related to the contextual scenario are incorporated intothe metadata. These additional variables are primarily used to enhance the search functionality andare not used for experiment duplication. For known and described modules, SimDL is able to storecontext-specific cells and provide search functionality over the identified input and output variables.

The raw content generated by large-scale infrastructures is intractable to support in real time;metadata plays a key role in providing DL services. Producing metadata records covering content isdone through automated metadata indexing in external data sources. The metadata indexes usedfor high level services are all contained in RDF triplestores. Low-level services utilize RDBMSfor metadata representation. While metadata record terms are highly dependent on the contentbeing represented, I have defined a four-part metadata schema structure to be used for all metadatarecords. The standard record structure contains Dublin Core terms, infrastructure terms, and formalterms based on formal definitions.

By combining automatically harvested and user supplied metadata, SimDL is able to provide searchfunctionality over a context with a designer defined amount of metadata storage. Significant inputand output variables may be searched to match a user query with existing experiment sets. It may benecessary for a module designer to alter the module design, and modification of the module schememay be required. With this approach, context-specific significant variables may be stored twice asthey appear in digital objects as well as in the metadata. After the metadata module developmentprocess, a given metadata scheme would be capable of ‘indexing’ all data files related to a cell withsignificant input, output, and user supplied variables for the particular cell’s contextual-scenario.


3.6 Experiments and Studies

As published in [6], computational epidemiology experiments were conducted to parameterize asimulation application and undertake predictive scenario studies. The effort was in the context ofpublic health interventions in the course of an epidemic (e.g., anti-viral distributions and schoolclosures). My efforts in this large study led to the identification and specification of digital libraryservices needed to support this community.

3.7 Need for SimDL

A digital library supporting scientific domains must manage a sequence of content includinglarge social network files, complex input configurations, geo-temporal output results of diseasespreads, statistical plots, publications, and geographical diffusion video clips. Simulation groupsin academia and national laboratories, using cluster, grid, and cloud resources, generally produceterabytes of simulation results when conducting studies in a variety of simulation domains. Theirepidemiological research environments produce massive quantities of scientific, highly-numericdigital objects. The sequence of content scales from several gigabytes for small studies to terabytesfor a national large-scale study. However, traditional DL services do not support simulation tasks.Data management is needed for content including public health research, medical and censusrecords, synthetic populations, graphs and networks, and simulation workflows. Service algorithmsand implementations cannot rely on previous work in developing full-text indexing and algorithmsthat utilize term frequencies, thesauri, dictionaries, and partial-matching scores. Simulation effortsin HPC often involve multiple experts from different domains and locations in collaboration ondeveloping models, software, and experimental studies. An average modeling and simulation groupmight include computational experts, software developers, statisticians, mathematicians, medical ordisease experts, entomologists, and public health officials. Only a select group of individuals will beable to connect to HPC resources, transfer data to necessary locations, design a correctly formattedinput set, invoke the simulation model executable, and perform analyses. Web-interfaces andDL-automated human intensive processes assist in exposing simulation systems to non-technicalusers but often only provide a restricted subset of available simulation features and possible studydesigns. Modeling and simulation research groups are commonly hindered by the lack of sufficientdigital library systems for simulation workflows and numerical research content. These groupsfrequently do not have personnel with expertise in data management and information retrieval.Non-uniform content management practices have hindered collaborations between institutions andfailed to support cohesive efforts within labs.

As seen from the preceding examples and highlighted publications, large simulation systems, suchas Simdemics, are capable of executing large, complex studies. Simulation models require largedatasets (e.g., human population networks at the continental level), HPC systems (e.g., grids andclusters), and large storage to store generated content. Petascale storage systems are in place for avariety of domains including the internet archive, scientific research centers such as CERN, and


national laboratories. As petascale simulation applications become prevalent, scalable managementand retrieval methods are required by research groups without prior experience or expertise indesigning and implementing systems for this context. The parameterization of study scenariosis often a drawn-out, human-intensive process. Executing an array of scenarios with enoughreplications to achieve statistical significance often requires a non-trivial, costly amount of HPCresources. Research organizations and individual researchers generally do not have the desire tomanage the entire workflow of data products systematically. Given the current set of available datamanagement solutions, managing complex objects remains a difficult process. Organizations alsomust face the problem that too much data is generated to keep everything. Consistent curation isrequired at the institutional level whereas many organizations rely on arbitrary deletion decisionsby individual data producers. The current class of digital library solutions does not supportthe information product preserving and repurposing functions required by large-scale simulationenvironments. SimDL is able to manage experiments and related content. The experimentationdesign process is streamlined by reusing portions of the existing corpus of input configurations.Backend processes may be automated with several SimDL services, e.g., memoization (usinginstead of recomputing existing results). Curation recommendations are provided based on rulesparameterized for each collection by suggesting what should be curated, what curation activityshould be done, and quantitative support for why the recommendation was made. Incentivizationreports are generated for contributors of software, content, resources, and human-intensive effortsin multiple user roles. With this set of minimal services, SimDL is the first digital library systemcapable of supporting large integrated research environments.

As published in [87], SimDL aims to provide a generic, extensible component-based digital library.SimDL is customized by tool builders through domain-specific simulation schemas and extendedthrough the addition of modular components. The utilization of schemas allows SimDL instances tosupport simulation applications for an unrestricted set of domains and contexts that have structuredinput requirements and a waiting simulation launching broker. Non-simulation scientific contentmay be handled by SimDL instances sans the simulation submission component (e.g., wet labs orinstruments producing digital data points along with environmental input conditions from high-throughput methods).

SimDL was designed out of a need to automate the holistic management of scientific data producedby multiple simulation applications at multiple institutes. Modelers and simulation developers alsodesired fully automated deployment of a basic DL instance and low amounts of continued mainte-nance. The management of scientific data requires policy decisions for data preservation, curation,and distribution mechanisms within the digital library. Shared access to a digital library hosted byone institution allows privileged experts at multiple locations to directly and indirectly collaborate,communicate, cooperate, and coordinate research efforts through the DL. The automation of con-ducting simulation studies is accomplished by the integration of SimDL into an existing simulationinfrastructure. Until scientific digital libraries become mainstream in the simulation community,DLs will likely continue to be integrated into a pre-established experimentation process and existingsoftware system. Customization of the simulation-launching component may be required to providean existing simulation engine with a functional set of inputs (e.g., transmitting XML input files to a


waiting computation request broker).

3.8 Simulation and Experiment Supporting Digital Libraries

As published in [87], my efforts have produced SimDL instances and services in several fields.Many institutions and research groups across a broad spectrum of domains make use of simulationapplications to produce experimental data. Simulation efforts may benefit from employing a generic,extensible digital library that makes use of a domain descriptive schema to manage content relatedto the experimentation process. Simulation applications in the field of computational epidemiologyare initially targeted due to interest by research institutions in this domain. The domain schemacontaining metadata concerning the required input parameters to the simulation application maybe used to customize a generic digital library, interface, and data management process. Multipleschemas are likely necessary to represent the domain semantics of a simulation system for severalcontexts. The use of schemas to represent contextual information, such as input parameters, allowssimulation supporting digital libraries to cover domains in addition to computational epidemiology.While data management practices for simulation related content varies greatly between domains andresearch groups, a generic digital library permits extensive reuse of prior DL development efforts tosupport structured simulation systems in a non-restricted set of eScience domains.

Efforts in developing SimDL aim to provide a generic management system for simulation groupsthat can be tailored to support individual simulation models [87, 88]. SimDL provides a genericcomponent-based framework with services that are customized for a set of simulation modelsthrough the use of model ontologies. The scope of SimDL is to provide a DL that may be integratedin HPC cyberinfrastructure architectures or an individual lab with services for epidemiology simula-tion study management, organization, and search. In designing SimDL services for epidemiology,we have developed a scheme in which many of the services are likely to be useful for other classesof simulations.

Digital libraries can be integrated with public health research infrastructures to support this commu-nity. DLs can be used to manage existing simulation inputs and results to allow full or partial reuseof existing input configurations and results. Workflows can be developed within an infrastructureusing SimDL to capture content as it is produced at each simulation workflow stage. Processescan be automated through a DL for tasks such as archiving existing input configurations, mappinginputs to result files, managing results to be sent to analysis scripts, generating metadata descriptionrecords for documents, user notification of events (e.g., results or analyses availability), and reuseof content. DLs also serve as a logical space for organizing content based on ontologies anddistributing results, analyses, publications, and related documents.

Currently, there is a notable lack of deployable DL options for research institutions. Existing DLslack the capabilities to communicate with cyberinfrastructure components, provision of numeric-based service implementations, automatically create metadata records, support simulation tasks, andfederate across simulation systems, models, and model versions. My goal in developing SimDL was


to generate a formal, reusable, and component-based DL framework to be utilized by simulationgroups in generating DL software and instances. The focus of this work is on automaticallymanaging scientific content while enabling scalable discovery and retrieval. As published in [92],SimDL provides at least the minimal set of functionality required by computational epidemiologyresearch groups. Formal definitions provide a foundation for the SimDL class of digital libraries.The formal definitions guided the development of SimDL software and instances. The definitionsallow DL services to be implemented and reused between systems and instances.

SimDL provides a complete digital library with novel services and implementations to directlysupport the epidemiology and network science research communities. The process for constructingSimDL included formally defining content, services, users, and user tasks; identifying the minimalset of simulation-specific services based on user modeling; implementing simulation-specific ser-vices; defining a framework for composing generic and simulation specific services; developingmultiple instantiations of the framework for a cyberinfrastructure and local institutional infrastruc-ture; and evaluating the effectiveness of the set of functionality provided by the framework. Aformalized description of SimDL and simulation content was presented in [87]. Formal definitionsof simulation-specific services have been formulated, implemented in software, and evaluated.These service definitions describe the minimal set of user tasks required by simulation-supportingDLs. The definitions are used to describe interactions between users and the infrastructure as wellas patterns of tasks. A broader set of potential service components to include in SimDL are shownin Fig. 3.4. However, efforts to produce SimDL have concentrated on identifying, defining, andimplementing the minimum required functionality.

Installations of the SimDL framework and software are currently in place for (1) a cyberinfras-tructure for network science applications including epidemiology, (2) two public health researchinstitutions, and (3) several simulation models. A specific cyberinfrastructure, CINET, and severallocal infrastructures at research institutions provide the initial set of environments for SimDL tosupport. These environments were selected as representative samples of possible use cases ofSimDL. A brokering system is used for SimDL communication with other infrastructure compo-nents. Evaluations of the sufficiency of formally-defined services, effectiveness at supporting usertasks, and performance have been conducted.


Figure 3.4: Broader context of services and system components in SimDL.

Chapter 4

Digital Library Foundations

Many digital library systems have been developed since around 1991. The digital library communityand DL projects generally do not utilize existing underlying foundations or theories that have beenfound useful in other domains, e.g., databases.

Minimal DLs are expected to provide a set of DL services that meet the anticipated scenarios ofa society in a given context. User societies in SimDL are composed of computational and socialscientists interested in computational epidemiology and network science for purposes ranging fromthe simulation of network dynamics to the refinement of public health policies. Metadata structuresand provenance tuples link sequences of input data, simulation results, network analyses, generatedreports, multimedia files, and publications. Archives of simulation-related content with related DLservices hold the promise of changing the way studies are conducted in simulation-based fields.The SimDL framework has been previously described in [87, 88], while some services were definedin [117].

Previous research efforts in digital library foundations have produced two digital library referencemodels, the 5S framework [52] and the DELOS Reference Model [24]. The 5S framework formallydefines DL-related terms using precise tuples [53]. The 5S framework consists of Societies,Scenarios, Spaces, Structures, and Streams. Various aspects of a DL, its content, its users, and itsservices, can be formally defined using combinations of each of the 5S terms. In the 5S framework,a society is a high-level component of a DL that describes the information needs of users. It definesthe administrators who are responsible for DL services, the actors who use those services, andthe relationships between them. An example of society in DL includes service managers whoare responsible for managing the services provided by a DL. Scenarios detail the behavior of DLservices. Scenarios may be used to describe the external DL behavior from the user’s point ofview, e.g., searching for documents by type in a DL. Spaces define logical and presentationalviews of DL components, e.g., vector space models used to rank documents related to queries. Astructure specifies the organizational aspects that are used in organizing DL content, e.g., metadataspecification representations. Streams describe the communication of information within the DL,e.g., bytestreams of video materials. The 5S framework was formulated with formal descriptions of

37

Jonathan P. Leidig Chapter 4. Digital Library Foundations 38

core DL concepts. The 5S framework has been used to generate multiple digital libraries includingNDLTD, ETANA, and AlgoViz, as described in [43, 45, 54, 136]. Additional research in 5S hasfocused on quality evaluations (5SQual) [109], a descriptive language (5SL) [51], and graphicalrepresentations (5SGraph) [162]. The framework has since been extended through the addition ofsubsequent definitions tailored to describe aspects of digital libraries within a particular scope, e.g.,content based image retrieval [111]. Common services required by many DLs involve indexing,searching, and browsing content as well as query and annotation processes [52]. Producing formaldescriptions of digital libraries is facilitated through reuse of the existing set of definitions.

The DELOS Reference Model is a DL foundations effort that parallels the 5S framework. Aspectsof a digital library that can be formally defined in 5S can be represented by one of the sectionsof the reference model. The reference model consists of six high-level terms including content,functionality, user, quality, policy, and architecture. The DELOS reference model describes thedigital library universe in terms of digital library management system (DLMS), digital librarysystem (DLS), and digital library (DL). A DL instance is a practically running system that ishosted to allow societies to interact with content. The DLS is the software implementation usedto deploy a class of DLs. The DLMS is the representation of the conceptual problem, solution,and DL approach, possibly defined in formal definitions. These definitions include content, users,functionality, architecture, and policy. The DELOS reference model may be used to define contentand services as well as to define relations between terms in concept maps of the DL universe[24]. Furthermore, the DELOS functionality working group has researched methods of definingcompositions of functions, an essential task in defining the minimal set of SimDL services.

The two models provide frameworks for describing a DL and all of its aspects. In this work,we use the 5S framework to encode formal definitions for a SimDL DLMS, see Fig. 4.1. The5S framework is utilized to precisely describe DL aspects. The DELOS model is utilized for aseparation of the concerns associated with formal definitions, DL software, and DL instances.The DELOS model’s separation of people and services into DLMS, DLS, and DL concepts isone of many potential societal organizations. In addition to formal definitions, semantics, e.g.,ontologies and taxonomies, describe the functionality and policies in SimDL. Different societalarrangements of the definitions and semantics would lead to different SimDL architectures. Weextend the use of the 5S framework to support services based on 5S definitions of content and users.We made use of the DELOS universe concept to guide our formalization effort. The software DLSand SimDL instances (DLs) are then produced from these definitions. SimDL may be describedconsidering each of these system classes. SimDL as a ‘DLS’ has been provisioned in a softwaretoolkit. Its application as a ‘DL’ includes active deployment in epidemiology and network scienceinfrastructures. The innovative aspects to SimDL include its conception as a DLMS. DLMS aspectsinclude its users, functionality requirements, content definitions, service definitions, service design,automated policies, and architectural design. Our goal was to fully design SimDL as a DLMS,create a useable DLS software implementation of the DLMS design, and deploy the DLS in twopractical DL prototypes. The following sections will provide formal definitions for the content,users, and services of SimDL.

The end-users, designers, system administrators, and application developers described in DELOS


Figure 4.1: Conceptual simulation digital library universe based on the DELOS Reference Modelusing 5S-encoded formal definitions.

can be mapped to formally defined societies in 5S. Similarly, functionality is mapped to services.The other top-level terms are mapped to combinations of 5S concepts, e.g., content is mapped tostreams and structures. The following examples demonstrate the potential for mapping from one ofthe formal frameworks to the other. The service manager user shown in the society concept of the 5Sframework can be defined in the ‘user’ concept in the DELOS model. The user concept covers thevarious actors that interact with DL. The example of searching for documents by type shown in thescenarios concept in the 5S framework can be defined in the ‘functionality’ concept in the DELOSmodel. The functionality concept defines the function as a specific querying service that a DL offersto its users. The vector space model described in the spaces concept of the 5S framework can bedefined in the ‘content’ and ‘architecture’ concepts of the DELOS model. The vector space modelrepresents content including the metadata that the DL handles and may be utilized in architecturesfor making data available to users. The example of metadata specifications shown in the structuresconcept in 5S framework can be defined as a type of structure defining ‘content’ in the DELOS


model. The streams of video materials in the stream example can be defined in the ‘content’ conceptin the DELOS reference model. Thus, it is possible to ‘translate’ our formal definitions into theDELOS framework albeit with a loss of the mathematical precision found in the 5S framework. Forthis work, precise formal definitions were desired to guide software implementations, reason overservice compositions, and produce a service registry. In other contexts, general definitions may bedesired to allow software developers leeway in the behavior and implementation of services.

By designing and implemented SimDL as a DLMS and DLS, we are producing a practical toolkitbased on formal theory and abstractions. The DLMS toolkit provides a solution to the well-definedclass of problems described earlier. The DLMS design and provided DLS implementation maybe used by simulation groups to deploy DLS instances within simulation workflows. Modifica-tions to the DLS will be required by simulation-groups to utilize site-specific databases, supportinfrastructure components, add functionality, and optimize service implementations.

One innovation of this work is in SimDL as a DLMS and the functionality it provides to simulationinfrastructures. Many of SimDL’s services are found in available digital libraries. However, ex-isting service implementations are not suitable for simulation content, tasks, and infrastructures,as they were motivated by a fundamentally conflicting context. Examples of these services in-clude logging content use for incentivizing, comparing numerical studies, and searching scientificcollections. Other SimDL services solve new problems that were not previously envisioned intraditional DL efforts. These services include incentivizing, memoization, semi-automated metadatadescription set generation, metadata extraction, metadata record generation, and record index-ing. This effort is aimed at advancing computation-based reproducible science. The benefits itprovides to simulation-based research include verifiable results and claims, replicable experimenta-tion, provenance tracking, managed collections, practical services, novel science-specific serviceimplementations, and usage reporting for content and software.

Precise formal definitions of DL services are paired with additional informal descriptions. Servicedescriptions include actors, activities, components, socio-economic and legal issues, and envi-ronments. These descriptions are constructed based on the taxonomy of DL terms provided in[50]. The inputs, outputs, pre-conditions, post-conditions, and implementation details for eachservice are provided, along with relational operators. 5SL, the language used for formal definitionswithin the 5S framework, provides a means of encoding the metamodel for the minimal simulationsupporting DL. The metamodel describes the architectural component of the DLMS, see Fig. 4.2.The preceding references and the previous body of work in related DL projects support my selectionof the 5S framework for the purposes of formalizing the SimDL framework.

4.1 Formal Definitions for Scientific Contexts

Digital libraries have recently been applied in a diverse set of scientific domains requiring sets ofservices distinguishable from traditional domains. Providing suitable coverage and implementationfor scientific digital library services remains a time-consuming effort. Ad-hoc, institute-specific


Figure 4.2: Metamodel of SimDL’s architectural design.

service implementations introduce the possibility for incompatibility, reduced interoperability, andredundant efforts. Formal frameworks provide a method to formalize the design, structure, andimplementation details for required services and existing reusable implementations. I have formallydefined scientific content, users, and services for simulation-related domains. Definitions are usedto guide service development through a generic scientific digital library toolkit. The definitions andtoolkit support DL generation and evaluation for specific domains.

Current scientific digital library efforts are increasing the functionality available to the scientificcommunity through domain-specific implementations [18, 38, 78, 96, 160]. As digital libraries arefurther expanded across academic institutions and scientific domains, coordination will be neededbetween digital librarians to reduce duplicated effort. The functionality of scientific applicationsrequires DL services not found in existing DLs (e.g., support for fully numeric content and large-scale scientific content providers). Existing DL services require re-implementation to supportscientific content due to deficiencies and mismatches between required and existing services (e.g.,


lack of comparable term frequencies and document frequencies in highly numeric data files).New methods for selecting, managing, and accessing scientific digital objects are required. Thecommunity of digital librarians developing digital library systems for scientific institutions wouldbenefit from a formal foundation to precisely define concepts and effectively collaborate whileremaining cognizant of domain-specific considerations.

4.1.1 DL Management Systems

In the first step, we apply the 5S notation of formal definitions to describe the DLMS layer, seethe inner circle in Fig. 4.1. A proposed DL’s content, users, services, and service implementationsare defined. The definitions include mathematical notations, input content, output products, pre-requisites, post-requisites, pre-conditions, post-conditions, and implementation details. SimDLrequired simulation-specific services for storage architectures, browsing and retrieval, contentfiltering, similarity scoring, incentivizing users, and supporting simulation infrastructures [88].Formal definitions of simulation users, content, and service requirements were developed early inthe DL design process [87].

4.1.2 DL Systems

The DLMS service definitions are used by DL developers to produce the required DLS softwaretoolkit implementation, see the middle layer in Fig. 4.1. We currently use the pairs of servicedefinitions and implementations to generate a service toolkit, seed a potential service registry,identify services pertinent to a required instance, and reuse implementations. We also make useof these definitions in the design of evaluation studies. As an example, formal definitions enablecomparisons between scientific DL requirements, services defined in the DLMS, and servicecoverage provided by the DLS toolkit.

4.1.3 DL Instances

The define-design-develop approach was applied in developing services for the case studies anda fingerprint digital library to support content-based image retrieval and fingerprinting algorithmanalysis [117]. See [88, 117] for descriptions of the case study prototypes and formal definitions ofthe users, content, and services for these scientific digital library systems.

4.1.4 Takeaway

Our work in developing formal definitions, implementations, and formal evaluations of DL in-stances has direct benefits to scientific DL development efforts. The definitions provide precise


implementation requirements for service developers and DL instance builders. Our ongoing effortsare in producing the DLS toolkit and service registry to promote future reuse of service productsthrough their context and implementation details. Formal definitions allow for formal analysisof interoperable and workflow services. This is conducted by reasoning over potential servicecompositions based on input and output behavior, pre- and post-conditions, etc. The definitions andtoolkits provide a starting point for service reuse and generalization in the scientific DL community.

The contribution of this work is a formal specification of the minimal set of services in an epidemiol-ogy simulation supporting DL based on the case study requirements and a working system, SimDL,with implementations for these services. An overview of the minimal SimDL functionality, asdefined in formal definitions, is shown in Fig. 4.3. With formal definitions, a collaborating digitallibrary research community could produce a registry of available, reusable DL services. As seen inthe search service definition, formalizations are required to differentiate, group, and form hierarchiesbetween multiple implementations of similar services. The definitions provided in this work are astarting point for formally describing required functionality in a representative sampling of possiblescientific DL efforts. The 5S framework was used to describe inputs, outputs, pre-conditions,post-conditions, and the desired functionality of a minimal set of SimDL services. Further effortsare required to produce a standard for formally describing implementations of these services in the5S notation and to further define the composition of specific function implementations. With theaddition of more specific definitions of service input formats, output formats, and implementationalgorithms, we envision future efforts towards a DL service registration system based on the 5Sframework notation for managing scientific DL services.

Figure 4.3: Definitional dependencies for the minimal simulation-supporting digital library.


Table 4.1: Informal service definitions [52].Acquiring Takes a set of digital objects, belonging to a collection not in the DL, and

incorporates them into some collection of the DLAuthoring Creates a digital object and incorporates it into some collection of the DLCataloging Incorporates a metadata specification into a set of metadata specifications

describing a digital objectConverting Takes a digital object and produces a version of it (by changing its streams,

structures, or structured streams)Describing Produces a description of a digital object (in terms of a metadata specification)

and incorporates this description into the object’s set of metadata specificationsFiltering Given a set of digital objects - and either (1) a query, a threshold (a real

number), and an index; or (2) a category and a classifier - produces a subsetof the original set, in which the objects either match the query with a weighthigher than the threshold or belong to the specified category

Indexing Given a collection, produces an index for itLogging Produces a log entry from some event from a scenario of some serviceRequesting Given a handle of a digital object, returns the respective objectSearching Given a query, a collection, and an index for that collection, returns for each

object in the collection a real number indicating how well the query matcheswith the object

Submitting Either incorporates (1) a new object into the collections of the DL; (2) a newmetadata specification into the set of metadata specifications of a digital object;or (3) a new operation into the set of operations of a service manager

4.2 Background Definitions

Previous work in 5S has produced a body of formalisms we reuse to describe SimDL. The develop-ment of SimDL builds heavily on previous digital library designs. As such, formal definitions forsimulation supporting digital libraries build greatly upon existing formal definitions for underlyingservices. Table 4.1 contains an informal description of these existing services and provides afoundation for simulation-specific services. Table 4.2 provides a partial formal definition for eachof these basic services with which simulation-specific services interact. See [52] for lower-levelsymbol definitions and pre- and post-conditions not elaborated here. Each of these services rely oncontent definitions described in previous work [50], e.g., collections Coll, catalogs Cat, repositoriesR, digital objects DO, and indexes I. At the lowest level, common symbols are consistent acrossservice definitions, as partially shown in Table 4.3. In scenario or service definitions, severalrelational operators are reused, e.g., precedes and reuses. Relations between societies are used todescribe human and computing users and their associations, e.g., includes, invokes, and associated.Relations between content, societies, and services are also encoded in 5S, e.g., executes, recipient,participates_in, uses, and runs.


Table 4.2: Formal service definitions developed from initial work proposed in [52].Service User input Other input OutputAcquiring {doi : i ∈ I},Ci None CkAuthoring None None doiCataloging hi,msik (hi,mssip) (hi,mssiq)Converting Ci None CkDescribing None doi msikFiltering q, t,Ck ICi,classCt {do j : j ∈ J}Indexing Ci None ICi

Logging None ei log_entryiRequesting hi None doiSearching q,Ci ICi {dok,wqk}Submitting doi,Cc,msic,DMCc ,opk None Ck,DMCd

Table 4.3: Standard low level formal symbols as described in [50].Concept Symbol Concept Symbol Concept SymbolDigital Object do Collection C Annotation annSet of annotations ans Handle h Binder biActor ac Hypertext Hyptxt Metadata Specifi-

cationms

Set of MetadataSpecifications

mss Classifier ClassCt Category c

Cluster clu Transformer t f r Space spWeight w Index for Collec-

tion CIC Query q

Stream stm Structure st Structuring Func-tion

ψi j

Threshold t Event e log entry log_entryRating r Operation op Set of Categories Ct


4.3 Content

Scientific and simulation-related content greatly differs from traditional digital objects. A simulationworkflow produces multiple stages of simulation content that require data management support.The stages of content include collections for model schemas, model ontologies, user input parameterconfigurations, output results, output summaries, standard analyses, documentation, annotations,and publications [87]. A digital library managing these collections must support text, binary, data,image, and video files, as well as hypertext.

Research data is not well handled by typical DL services. A single experiment from one user mayconsist of hundreds of gigabytes of results requiring management as a single digital object. As thelargest digital objects begin to grow into terabyte range for general scientific research, typical poli-cies for distribution, transfer, and curation must be re-thought as some content may be regeneratedat an acceptable computational cost. In massive-scale computing applications, e.g., weather andclimate modeling at Argonne National Laboratory, simulations produce petabytes of results persimulation study. Indexing and searching DL services for numerical and graph files are currentlyinefficient and generally based on previous work in full-text collections. SimDL approaches theseproblems by using ontologies and metadata harvesting scripts to generate individual model-specificmetadata records.

Figure 4.4: Representation of scientific document workflow. See [87] for additional informationregarding the workflow.

Allowing a digital library’s users access to an entire provenance stream allows claims to be supported,organizes the existing body of previous work, and allows data mining of collections related toa simulation system. See Fig. 4.4 for an Open Provenance Model representation of the content


definitions [108]. The provenance metadata depicted as directed edges in Fig. 4.4 are tracked inSimDL within the provenance services. The following definitions provide a representation of thecontent supported by simulation-specific services, as presented in [87]. These definitions buildupon the set of previous 5S definitions (i.e., handles, streams, structures, digital objects, complexobjects, and annotations).

Definition 1: A schema is a digital object of tuple sch = (h,sm,S) where

1. h ∈ H, where H is a set of universally unique handles (labels);

2. sm is a stream;

3. S is a structure that composes the schema into a specific format (e.g., XSD structure ofelements and attributes of restricted values).

Definition 2: An input configuration specification matching an XSD schema is a tuple ic f g =(h,sm,ELE,AT T ) where

1. ELE is a set of XSD elements;

2. AT T is a set of XSD attribute values for an element ELEi.

Definition 3: A sub-configuration, a subset of an input configuration, is a tuple sub− ic f g =(h,sm, ic f g,ELE,AT T ) where

1. ic f g conforms to an XSD schema and ic f g⊇ sub− ic f g;

2. ELE is a set of XSD elements where ELE ⊆ ic f g’s ELE;

3. AT T is a set of XSD attribute values for an element in ELE.

Definition 4: A result is an output from a simulation or experimentation process, corresponding to aspecific ic f g, and is a tuple res = (h,sm, ic f g, proc) where

1. ic f g is the matching input configuration;

2. proc is the process used to generate the result from ic f g.

Definition 5: Analysis of a set of experiments is a complex object consisting of textual or numericdocuments and images (e.g., plots and graphs) and is defined as a tuple ana = (h,SCDO = DO∪SM,S, ic f g) where

1. DO = do1,do2, ...,don, where doi is a digital object;


2. SM = sm1,sm2, ...,smn is a set of streams;

3. S is a structure that composes the complex object cdo into its parts in SCDO;

4. ic f g is the input configuration and one-to-one mapping to a raw dataset.

Definition 6: An experiment is a complex object consisting of the full range of information consti-tuting an experiment within a domain and is defined as a tuple exp = (h,SM,sch, ic f g,ana,D,A)where

1. SM = sm1,sm2, ...,smn is a set of streams;

2. D = d1,d2, ...,dn a set of additional documents, e.g., summary or publication;

3. A = an1,an2, ...,ann a set of annotations describing the overall experiment and individualdigital documents.

4.4 Societies

User and system modeling has been the focus of research activities in the social aspects of computing.The Zachman framework produces a matrix of rows representing different people’s perspectives ona project where columns represent what they are seeing from that perspective (e.g., data, activities,locations, time, and motivation) [66]. Perspectives are influenced by the activities associated witha user role, e.g., planner, owner, architect, designer, builder, functioning system. The followingcommunities maintain inter-community relationships through activities as described through a setof community-specific digital library scenarios [87]. These user roles do not include the completeset of stakeholders outside of the DL context, e.g., modelers and system developers who producenew simulation models.

1. Tool builder: formally define a proposed DL; locate, reuse, and assemble existing com-ponents; generate novel components; deploy a DL instance; integrate with a simulation ordigitized experimentation infrastructure

2. System and DL administrators: set data management policies; clean datasets or metadata;curate content; manage accounts; evaluate existing components

3. Related systems: submission and retrieval with experimentation applications and high-performance computing architectures; analysis of submissions and retrieval with analystoriented software; validation of inputs

4. Study designer: generate new model schemas and transform to updated versions; enter orload input configurations; query, browse, and load sub-configurations; save configurationsand sub-configurations; validate configurations; submit configurations for execution; monitorexperiment progress


5. Analyst: submit new analysis requests; view automated or requested analyses

6. Annotators: mark annotations on streams of content or individual documents

7. Explorer: query and browse collections or experiment streams of documents

Societies are defined in 5S as a tuple society: (C,R), community and relationships betweencommunities that form around related activities. R = {r1,r2, ...,rm} where each tuple r j = (e j, i j), eis communities, i activity.

4.5 Scenarios and Services

We aim to formally define a suite of DL services to be implemented in SimDL and other simulationmanagement systems. Formal definitions provide a means of explicitly specifying the requirementsfor services. Formal definitions allow for registered, existing services to be identified and reusedin multiple DL implementations. Requirement identification and development may use formaldefinitions to guide efforts. Definitions may be used to prove sufficiency and completeness ofservices without specifying implementation decisions. With formal definitions of services, weenvision a future registration system for simulation-specific services described by languages such asUDDI, OWL-S, and the 5S framework. A service registry would be useful in rapidly prototyping DLservice layers that support HPC infrastructures and scientific applications as described in [61, 104].

The following set of services are essential for a minimal simulation supporting digital library.Existing service definitions and notation are reused where possible but are mainly limited to text-based versions of simulation-specific services, e.g., generic indexing and search. See Fig. 4.5 foran overview of the minimally required simulation-specific DL services and broker connections.Several services have been defined and implemented that are specific to simulation-supportingdigital libraries, see Table 4.4. These services either do not exist in existing DL software or requireimplementations that are not compatible with existing software. Each of these services is elaboratedwith a description, goal, set of scenarios, and formal definition within the following section.

The 5S approach generates tailored DL instances based on metamodels, 5SL definitions, the 5SGendigital library generation suite, and a component pool of existing software, see Fig. 4.6. We haveformalized SimDL, extended 5SL to encode SimDL, and generated the component pool and DLSfor SimDL manually. This was required as suitable service implementations did not preexist forhandling simulation processes. The interaction between services is vital to build a hierarchy ofincreasingly complex services. For future uses, the services and definitions provided here willbe reusable by future DL Implementers through the SimDL toolkit implementation and matchingformal definitions.

The class of large infrastructure environments targeted in this work contain metadata databasesdescribing simulation-related content in multiple forms and collections internal to tightly coupledcollaborator systems. Similar systems for managing distributed metadata databases would benefit


Figure 4.5: Organization of the minimal simulation-supporting services and communication.

Table 4.4: Informal service definitions for simulation science.Supporting Metadata Given a specific simulation process (1) semi-automates the generation

of metadata description sets, (2) describes the methodology to extractmetadata values from files, (3) describes the expected metadata recordformat, (4) automatically processes files as they are generated by a simu-lation system, (5) automatically produces properly formatted metadatarecords, and (6) indexes the metadata records in a simulation metadataindex and catalog

Searching in Science Given a query and index of metadata covering scientific collections,returns a listing of similarity scores related to the pertinence of eachdigital object with the query

Incentivizing Logs and tracks the usage of content, software, and services and providesreports to producers

Memoization Takes an input configuration for a simulation request, identifies dupli-cate or overlapping simulations, and returns the matching completedsimulation results

Provenance Tracking Maintains links between related content based on a defined scientificworkflow

Curating Takes a collection, metadata index, and curation rules and recommendscontent from the collection and index, as the rules dictate, for arching,deletion, migration, and preservation


Figure 4.6: DL modeling and generation architecture guided by DL generation theory provided in[49].

from similar curation services at the infrastructure level, see distributed hash tables and P2P net-working from [128]. Previous work in developing SimDL to provide interoperability for multipleinfrastructures has uncovered the need to assist individual organizations within the scientific com-munity to automate the: design of metadata databases, capture of metadata for scientific products,and long-term management of content. As suggested in [158], there is a breadth of potentialimplementations when allocating responsibilities between DBMS, the surrounding infrastructure,and applications. Our approach implements automated functionalities at the level of infrastructureservices in contrast to widely-practiced human-intensive database administration (DBA) functions,e.g., indexing and tuning, commonly implemented at the database level. Digital libraries providenumerous infrastructure-level services such as searching, browsing, archiving, preserving, anno-tating, and recommending. DLs manage scientific content and make use of metadata databasesand indexes to provide these higher level services. These can make use of existing techniques toimprove efficiency, such as index self-tuning [148] and periodic automated index management [63].Event-based triggers at the infrastructure-level may be used in self-healing in terms of disk spacefaults, and preventing failures, as identified in [158]. Our curation framework produces event-baseddigital library systems that process infrastructure events (e.g., input file saved, result generated, diskspace warning issued) and perform automated infrastructure-level curation services (e.g., generatean input file metadata record, archive results, and remove low-priority content based on existingpolicies). In cloud computing and large data-intensive simulation systems, the automation of thesetasks reduces the effort and time requirement for database administrators and scientists in taskstangential to their primary responsibilities.


Formal definitions of each service are provided along with the service description in the ‘Simulation-Specific Services’ chapter.

Finally, SimDL is a 4-tuple (R, DM, Serv, Soc), where

• R is a repository consisting of scientific content collections (e.g., input configurations andresults);

• DM = {DMC1,DMC2, ...,DMCK} is a set of metadata catalogs for the collections {C1,C2, ...,CK}in the repository;

• Serv is a set of services containing simulation-specific services for at least metadata extraction,metadata record indexing, searching, curation, provenance tracking, and incentivizing, aswell as the set of services found in original, non-simulation digital library definition; and

• Soc is a set of societies including administrators, analysts, annotators, explorers, externalinfrastructural components, study designers, and tool builders.

Chapter 5

Simulation Digital Library Framework

5.1 A Simulation Supporting Digital Library

SimDL aims to fill the gap between commonly available DL software and the requirements ofsimulation environments. The digital library is intended to be integrated with a larger simulationsystem and is not distributed as a standalone system, see Fig. 5.1. SimDL is based on a broker systemin which multiple external brokers exist for DL communication. As such, SimDL’s architectureconsists of internal services and broker connection APIs. The digital library broker coordinatesthe data flow from external infrastructure brokers and provides a content access point for availableservices. SimDL APIs and brokers allow for component communication and integration withinsimulation environments. Ontologies describing individual simulation models are used to forma network of related repositories. Repositories with various stages of model-specific content areintegrated through links which form a domain meta-ontology from individual ontologies. Domainspecific metadata from multiple simulation models are harmonized for inter-model interoperabilityand search.

SimDL provides common services found in many digital library implementations as well as novelcontext-specific services. The services provided by SimDL are primarily implemented in themiddleware agent which is supported by the user interface and storage system. The fundamentalservices in SimDL, and digital libraries in general, are the ability to store and retrieve information.This is supported by input connections to Simfrastructure, the user contribution upload interfaces,and the ability to deposit content to the storage system. Administrators require the ability tomodify or manage the collection’s state as well as SimDL itself. Search functionality requiresan indexing scheme and metadata storage. Metadata is leveraged to provide a higher-level viewof simulation data and related files without investigation of each digital object itself. Indexingfor a set of experiment scenarios allows for queries over the range of experiments. Third partyaccess is provided to support extended tools that may leverage SimDL’s services. The searchfunctionality commonly provided by current digital library implementations provides the ability

53

Jonathan P. Leidig Chapter 5. Simulation Digital Library Framework 54

Figure 5.1: Prototypical simulation-based research infrastructural environment.

to search metadata and full-text search in some cases. SimDL requires a new style of searchfunctionality as widely-used standards for full-text document mining and search are not generallyrelevant to scientific content, with the notable exception of scholarly publications.

The SimDL framework provides DL service abstractions for producing a DLS to fulfill a DLMSdesign. The service software implementations are modular and capable of building on lower-levelservices. The metadata schemas utilized by services are implemented as RDF triplestores, allowingarbitrary metadata formats and records to be sent to the DL for storage. External components areable to modify metadata schema and index or search content without changes to a structured datamodel or DL. This also allows multiple projects, workflows, and content types to use the same DLimplementation. The DLBroker communication workflow maps service requests to DL services.With this abstraction, multiple DLs tailored to different domains or projects may be connected tothe same infrastructure. The exposed services are domain-free. The incentivizing service allowscustomizations through dynamically set incentivization report and metric templates. The searchservice utilizes query model arguments of the similarity and ranking aspects of searching. TheDL services have a high level of modularity. This is demonstrated in the dependencies and withlower-level services in the higher level collection building and information satisfaction services.The services can execute independently within this space. The design is useful beyond the CINETand Simfrastructure case studies. The SimDL DLMS and software toolkit is available to be utilizedin other contexts with arbitrary data models without changes to the internal metadata indexingscheme.


5.1.1 Computational Scalability

In addition to file storage issues, the computational availability of data management systemsis constrained by hard limits. Providing SimDL services will require processing for each userquery as the metadata for the total set of experiments may have to be ranked. In addition to thecomputational power required to deploy SimDL for a research group, Simdemics will requireconsiderable computation to perform an experiment. Simulation wall-time may span a few minutesto several weeks depending on the number of agents, complexity of behavior, range of time steps,and experimental runs. Assuming these experiments would be run regardless of the existence ofSimDL, the digital library framework may reduce the computational effort should users select toretrieve existing simulations or multiple experiments cover the same input ranges as requested forslightly different scenarios.

5.1.2 Scientific Data

Experimental data in the epidemiology and network science contexts require a data model withspecial considerations. The investigated scenarios are based on the diffusion of a given phenomenonacross a sizable dynamic network. In these simulations, agents and behavior may be represented asnodes and interactions as edges in a network. This structure provides demographic rich datasets aseach node may contain detailed information on the agent being represented. The network may scalefrom dozens of computer systems to billions of humans or poultry. These dynamic networks changeover time. Some studies may require capturing large entire networks at given points of time. At aminimum, the initial network of agents is required in storage. In the CINET case study, a similartype of content exists in the form of large-scale networks from multiple domains. Specific discreteevents in epidemiological simulations, such as agent activity or geo-temporal travel behavior, mayrequire storage for each event and agent pair that may occur in simulations. It should be noted thateven within epidemiology, the diverse set of known and modeled diseases may require a range ofsimulation techniques and separate software solutions. For example, simulating a human-to-humanairborne disease scenario requires models, inputs, datasets, and disease diffusion that are differentfrom that of a vector-borne scenario. A digital library for epidemiology simulations may requireconnections to multiple simulation software tools that are individually specialized for respiratory,gastrointestinal, sexually transmitted, contamination, and insect vector-borne disease scenarios.Considering the combined set of scenarios, the multiple disease models, transmission vectors, andpossible intervention strategies, it is clearly unsuitable for a single simulation tool to attempt tosimulate all possible experiment sets. Although each tool generally simulates the diffusion processof a phenomenon across a network of agents, the process for simulating each scenario may requirespecialized methods. Due to the stochastic variance and popularly known butterfly effect, the exactinputs to the system are required to provide consistent and deterministic output production. Thecomplete storage of inputs and the resultant simulation outputs provides a rich environment for datamining, content reuse, and repurposing. Fig. 5.2 describes a simplified version of the informationworkflow of data products as generated in the case study systems.


Figure 5.2: Information workflow in conducting and archiving an experiment.

5.1.3 Data Integrity

The validation of results depends on the quality of the modeling conducted for the underlyingnetwork, agent behavior, and phenomenon diffusion over the network. Unfortunately for researchers,datasets surrounding disease occurrences across the globe are incomplete, and there is the lack of anacceptable dataset standard. To complicate matters, a disease may not have a standard vocabularyfor items associated with the disease. It may be impossible to intelligently compare the validity andquality of results coming from multiple simulation tools which use separate agent and phenomenondatasets. Unlike publications, data are not usually directly evaluated for validity or potential reuse[150]. Within a group, standardized and accepted procedures are typically in place for commondata. Those outside of a research group may be granted access to a dataset without knowledge of thedataset’s integrity. The dissemination and reuse of scientific data would be quite valuable for longrunning and computationally-intensive simulations. Metadata models are to either include or containinput files that include all simulation parameters. Investigations on the integrity of the selectedsimulation parameters and underlying datasets would require additional and complimentary effortsoutside the scope of SimDL. The role of SimDL in this process is to maintain the available metadatarelating to data validity and integrity. Integrity information beyond the scope of SimDL (e.g., thegeneration of synthetic populations) is not captured during the simulation process. However, thismay be added by extending the utilization of the provenance registration services in non-minimalcase studies. Descriptions of additional integrity-related information may be added by attachingadditional files to experiment sets through the use of related metadata.


5.1.4 Quality of Service

As SimDL is planned for research group deployment, its implementation must address severalissues. To fulfill storage requirements, the selected data formats and storage scheme should requireas little space as possible to provide input coverage. Storage of content will require a sufficientstorage space allocation as well as permanent availability as provided through the ‘data manager’component. To assist in preservation, data backup options are necessary for replication. Data may becompressed with widely available compression tools to limit the storage space and file transfer sizes.Network traffic should be limited through the use of compression and transmission of only the set ofrequested files though it is assumed that sufficient bandwidth is available for transmission of largefiles. The metadata and index structure should be minimized to the level of sufficiently meetingthe requirements of all services. The metadata selected should be appropriate to context-specificsimulation scenarios. Thus, various types of simulation domains and content types will requireslightly different metadata schemes. To promote general acceptance of the results and analyses, thecontext-specific modeling of inputs should adhere to informal institutional guidelines. A bound onthe system search time is required to ensure usability of the searching functions. This computationalrestriction may culminate in a limit on open user connections to the maximum serviceable amountper the targeted deployment system. This requirement is related to the computational power devotedto performing search on the index and metadata items. Performing large-scale, complex simulationsrequires a large amount of processing. It is necessary for large amounts of computational powerto be available for the purpose of ensuring that users are not unnecessarily delayed in retrievingdatasets relating to dynamically conducted simulations. Should the digital library fall suspect to amalicious denial of service attack, SimDL must recover and rollback to earlier metadata and indexbackup files. As the digital library will require continued development and integration of yet to bedeveloped infrastructures, the system design is modular and extensible enough to permit differentstyles of brokers or wrappers.

5.1.5 Integration and Framework Extensions

A key component of this effort is to link the digital library to the simulation software capable ofproducing parts of the collection. The boundary between these two independent systems determineswhich software is responsible for certain functionality. For example, users may use softwareexecutables directly and produce simulation sets separate from SimDL. Several Simdemic interfaceshave their own implementation of query input matching to identify exact matches to previoussimulation runs. Ranking mechanisms are an integral part of a digital library and provide discoveryover the corpus of registered content. For experiment sets in Simfrastructure not known by SimDL,a call would be made to rerun a previous experiment, at which point SimDL recognizes an exactmatch and returns a link to the existing simulation results in SimDL. Regardless of whether newcomputation is required or not, SimDL is eventually provided with the required simulation outputsand analysis produced by applications as well as logs of the users who interacted with them.

Integration is achieved by registering an infrastructure’s blackboard system with SimDL, through


the SimDL properties file. SimDL then monitors this blackboard, receives service requests, andplaces results back onto the blackboard. Additional modifications are needed in existing externalcomponents to redirect existing processes, workflows, and functions in order to send requests to theDL.

The framework can be modified through the addition of new services, modification of existingservice implementations, and the parameterization of services. To add a new service, the nameof the service needs to be registered in the DLServiceType enumeration (one line of code). Theservice code itself is added to the code base as a standalone class that extends the DLServiceinterface and provides an execute() method. Lastly, the DLBroker’s service selection algorithmmust be updated to correctly process requests from the blackboard and forward the request to thenew service (two lines of code). Existing service implementations are modified by optimizing theexecute() method within the service class. Parameterization of the incentivization and curationservices require additions to the reports, statistical measures, partial Boolean decision clauses, andcuration operators. While not elaborated here, the procedure for completing this process is clear inthe technical documentation and code. There is no current formal definition registration service orprocess.

5.2 The Software Architectural Overview

The digital library maintains a set of thirteen services that may be invoked directly or through apredefined blackboard and DLBroker. The digital library code includes a set of services that eachutilize the DLDataItem to return generic results back to the DLBroker. There are dependenciesbetween these services. The dependencies generate a multi-tiered digital library with more complexservices building upon lower-level services. See Fig. 5.3 for a depiction of the data flow betweenclasses within SimDL. See Fig. 5.4 for a depiction of how control flow proceeds between SimDLcomponents.

A centralized blackboard and component brokers manage the communication of messages and datain the SimDL infrastructure. In addition to passing messages between system components and thedigital library, the infrastructure also automates the experimentation workflow. This approach isbased on existing systems outside of the public health domain that compose models into workflows[33, 99]. See Fig. 5.1 for an overview of the style of infrastructure SimDL has been integrated within each case study.

After receiving a simulation request, the brokering system is used to locate an available compu-tational system with the appropriate simulation software. The input files are transferred to theidentified system and the simulations are launched. Progress updates are sent back to the userinterface as individual simulation runs are completed for progress monitoring purposes. Afterthe simulation process has finished, results are transferred to the digital library for archiving andforwarded to the analysis system. Results are directly linked with input configurations and modelschemas used in previous steps. This provenance information could be used by the analysis broker to


Figure 5.3: SimDL internal data flow architecture.

Figure 5.4: SimDL component execution control flow.

invoke analysis software that is pre-defined to support a particular simulation model. Modelers and


policy makers generally do not have extensive technical background or training. The infrastructure’sautomated workflow abstracts the need for novice users to fix formatting errors, transfer data toremote machines, and execute command line applications and analyses.

Through blackboard connections, SimDL manages content such as input configurations and studydesigns as they are saved by users, results after simulations terminate, and image plots after analysisscripts are finished. The digital library broker acts as an adapter between the blackboard and SimDL.This indirection allows for other digital libraries and data management systems to be added tothe system to complement SimDL. It also allows for more than one digital library to service thesame blackboard system. The content managed by SimDL resides in databases and file systemson computational platforms. The five types of simulation content produced by the simulationworkflow, see Fig. 3.3, form the five collections managed by SimDL (e.g., schemas that describesimulation models). For each digital object in the collections, SimDL automatically generates ametadata record consisting of simulation model-specific terms, collection-level terms, Dublin Coreterms [153], and formal 5S framework terms [53]. The metadata records are then able to be usedby other services, e.g., to rank content for user searches and execute automated curation policies.Tables 5.1 and 5.2 contain a brief description of each of the SimDL classes.

It is possible for the architecture of SimDL to be described and implemented with various designstyles. Since raw data itself, not a high level summary of results, is involved, it is likely that arather large amount of data will be ultimately stored in each instance. The scope of the digitallibrary will determine what types of experimental results should be stored. Equally important to thescope of experiments being stored is the way in which the data will be used in the future. Ideally, aresearcher may be able to save effort by using previous results instead of running new experiments.To accomplish this, the metadata must represent digital objects results accurately. A search functionwill allow the user to find previous experiments that are similar to the inputs being investigated.If no matching experiment inputs are found, this may imply that certain sets of inputs have notbeen previously modeled and simulated in earlier studies. If distantly matching experiments arereturned in a ranked list, these experiments may provide a basis for examining the effects thatvarious input parameters for the experimental scenario have on the results of the experiment. Whenhighly matching ranked results are returned, there may be an indication that experiments with thistype of inputs have been studied before with available results already available. Thus, using a digitallibrary to archive results may actually lead to a system for identifying viable areas of interestingexperimentation.

SimDL initially supports a hierarchal XSD schema consisting of input attributes, elements, restric-tions, and documentation. Simulation systems without a hierarchical structure to input parameterswill simply have a flat schema consisting of mainly top-level elements. Hibernate’s2 automationfunctionality allows database tables to be constructed to match a recently uploaded schema andmap records in each table with objects SimDL may handle. This automation produces the SimDLstorage layer for a specific schema and allows for multiple DL components to access a schema’srelated collections based on the schema itself without interaction from a system administrator or

2http://www.hibernate.org


Table 5.1: SimDL service and broker classes.DLBroker connects to an assigned blackboard (test, development, production),

reads blackboard entries sent to the ‘DLBroker’, processes a sent entrybased on the requested DLServiceType, and returns the result to theblackboard.

DLBrokerTest contains testing scripts for the broker and all of the DL services andclasses.

DLCurate executes and evaluates the set of rules defined for a curation activityand returns a listing of digital objects that are suited for that activity.Currently implemented actions include preserving, deleting, archiving,and migrating. Execution of the set of rules will produce a collection ofdigital objects that are recommended for the attached curation activity.

DLCurateAction is an enum representing: preserve, delete, archive, and migrate. Theseare the possible activities that may be guided by specific DLCurateRules(policies).

DLCurateBooleanPartialDecisionClause

builds and executes queries over metadata and encodes a restricted setof SQL and RDF operators (see DLCurateOperator). This is used todiscover the set of digital objects that match specific metadata criteria.The criteria initially include in, between, >, <, <=, >=, and =. Thisallows more complex rules to be defined which combines the resultsfrom multiple partial decision clauses.

DLCurateCombinationType

is an enum representing: arithmetic (mean), geometric (mean) harmonic(mean), summation, and maxContributingClause. These represent thepossible mechanisms for combining and ranking the results from multiplepartial decision clauses.

DLCurateDORanker takes a set of Boolean partial decision clauses, executes each clause,and returns combined results to fulfill the full clause by combining themultiple results as defined by the requested DLCurateCombinationType.

DLCurateOperator is an enum representing: in, between, lessThan, greaterThan, equals,lessThanEquals, and greaterThanEquals. These are used to generate anSQL or RDF query statement pairing a specific metadata term with thequery or clause inputs.

DLCurateRule encapsulate a single rule composed of partial decision clauses, combi-nation type, threshold, and curation action. Each digital object with acombined clause weight that exceeds the threshold is returned in a listsuggesting the action be performed on that object.


Table 5.2: SimDL service and broker classes (continued).DLFilter builds and executes an SQL query of arbitrary metadata terms and values.

The returned data is a listing of digital objects that exactly match all ofthe query terms.

DLGetDLID generates and returns a new unique digital library identifier, DLID.DLGetMetadata queries and returns the full set of available RDF metadata predicates and

objects for a single request subject.DLIncentivize executes a requested report and returns formatted text produced by

combining all of the report metrics’ results.DLIncentivizingMetric maintains a list of possible metrics (SQL statements) that mine the RDF

metadata table. These may be utilized in multiple reports.DLIncentivizingReport defines a single report as a set of specific DLIncentivizingMetrics.DLIncentivizingReportType

is an enum representing: provider, graph, software, result, hpc, user, andsystem. These are the initial set of available reports.

DLLog inserts a set of RDF metadata triples in a metadata record as a batch.DLMemoization utilizes the DLFilter and DLGetMetadata services to identify a set of

digital objects that exactly match a metadata pattern and returns thematching Data Manager SIDs if known by the digital library.

DLProvRegister allows the provenance sequence between two digital objects to be as-signed.

DLProvResolve retrieves the set of digital objects that maintain the queried provenancetype (i.e., precedes or follows) with the queried digital object subject.

DLProvType is an enum representing: precedes and follows. This allows the prove-nance for a workflow of digital objects to be assigned.

DLRegister inserts a set of RDF metadata triples in a metadata record as a batch ofRDF triple inserts.

DLDataItem utilizes Java Generics to allow an unspecified data type to be stored.Each DLService implementation returns this object.

DLSearch builds and executes an arbitrary weighted Boolean query over the meta-data index. The results are a ranked list of digital objects with thesimilarity score of each object to the query using a modified Rogers andTanimoto similarity score.

DLService provides an interface that all DL services must implement. Specifically,each service must have an execute method with Generics arguments andreturn type.

DLServiceType is an enum representing: DLLog, DLRegister, DLIncentivize, DLSearch,DLFilter, DLMemoization, DLCurate, DLCurateDORanker, DL-ProvRegister, DLProvResolve, DLGetMetadata, and DLGetDLID. Thisis used in the DLBroker to direct a service request to a service capableof fulfilling the specific request.


tool builder.

5.3 Service Organization

The services in SimDL are split into ‘collection building’ and ‘information satisfaction’ services.Collection building services are generally services that build and organize collections of content.Information satisfaction services allow content to be accessed and in SimDL also support scientificactivities. Fig. 5.5 displays potential science-supporting services beyond the minimal functionalityfound in SimDL. Within these categories, services are organized into workflows of low-level andhigh-level services. These terms are not an indication of complexity but whether a particular serviceinteracts with digital object and metadata collections (low-level services) or collections and otherservices (high-level services). These distinctions lead to directed dependencies between servicesshown in Fig. 5.6. These dependencies arise as high-level service implementations perform softwarecalls to lower-level services.

Figure 5.5: Possible science supporting services.

5.4 Interoperability

A digital library tasked with managing content from an entire scientific method will contain multipleheterogeneous collections, each potentially considered a non-autonomous digital library. Thedigital library system enables the federation of collections and internal processes dealing with


Figure 5.6: SimDL service dependencies.

managed content. Interoperability between collections of related content is provided by leveragingthe provenance mapping between sets of items derived from each experimentation stage.

5.4.1 Functionality, Services, Scenarios

SimDL acts as a services provider when integrated within a larger infrastructure. Services providedthrough the SimDL broker to user interfaces allow access to multiple collections and ease the effortrequired to manage users, support roles, and allow users to switch roles as their access allows. Theautomation services utilized by the interfaces in each of the case studies provides a browser with aconsistent look and feel across user roles. The digital library’s server API allows customized UIs,designed by tool builders, to take advantage of the underlying services. A simulation-launchingcomponent may submit simulation requests and input configurations to a backend staging space.Simulation brokers monitoring the staging space can process the request and launch the experimenton a set of optimized computational resources. Similarly, submitting analysis requests can beautomated as a digital library service. Through the use of an analysis request staging space, analysisprocedures and software can be invoked for a recently completed simulation’s datasets. SimDLautomates the process of creating, inserting into, or querying a database after processing the logic


contained by a selected schema.

Functionality wrapped in modules is reusable when there exists clear specifications of each service,definitions of underlying digital objects, and pre- and post-conditions as suggested in [142]. Codeporting, integrating existing components to serve specific DL functions, is assisted with formalspecifications of a desired system. Thus, formally defined services in an existing SimDL instanceare reusable in generating other instances. Data resulting from a simulation process is typicallyanalyzed to produce knowledge in regard to a tested scenario. This experimental process provides aprovenance chain resulting from the composition of defined functions. Several stages of the experi-mental process are automated in SimDL by the composition of simulation and analysis submissionfunctions. An intuitive querying function over simulation parameters and user annotations providesaccess to the stream of digital objects generated by each of the composite functions. SimDL differsfrom other DLs in that functionality is included through integrating new internal components, notthrough connecting infrastructures as in D4Science [23].

5.4.2 Users

Multiple experts are required to conduct complex simulation-based research. Similarly, multipleuser roles exist for simulation-based digital libraries which support collaboration, participation, andprivacy as defined in [75]. In SimDL, collaboration is assisted by capturing and exchanging infor-mation produced from complex tasks performed by users at different stages of the experimentationprocess. The collaboration requires multiple tools to answer research questions by translating aquery or simulation request into successive units of work. Although the submission of simulationresults to an analysis system may be automated, participation from users is required to launch newsimulations, perform customized analyses, draw conclusions, and annotate streams of content. Userswith similar and differing roles work on the same content, further existing work between and withinstreams of content, and switch between consuming digital objects from previous experimentalstages to providing content for users in successive stages. Privacy is provided by restricting accessto an institution’s users and trusted external individuals or groups. Collections are partitioned toprovide access to users of a particular role (e.g., a general public account might allow findingexisting documents and publications but not proprietary simulation schemas). In addition to privacyconcerns, practical consideration may restrict the number of users with a simulation launching roleto reduce the stress on computational and data storage resources. A digital library’s interaction withits infrastructure affects its provided user interoperability. Users under most roles interact with adigital library’s content and are presented with the ability to interact with multiple schemas andversions of schemas across all available simulation applications, domains, and contexts within asingle, generic interface. By selecting tabs in the UI, users with a particular role may transitionseamlessly to another role (e.g., study designers switching to analysts when viewing the resultsfrom a launched simulation).

Tool builders are provided with reusable, formally defined components. The use of a schemaautomates the UI generation database mapping, simulation launching, and collection management


processes. Human involvement consists of developing an XSD schema for a simulation system andstarting the DL generation process.

For study designers, the digital library hides the complexity of launching and analyzing simulations.The designer may reuse and modify existing sub-configurations, submit simulation requests, makeuse of data management services, and interact with the simulation infrastructure.

Analysts are able to retrieve datasets and summaries of datasets along with the input conditions fromwhich the results are derived. Automated tasks for analysts include the generation of plots, summarystatistics, and the potential for extensions to perform standard analysis tasks useful for a specificinstitute and simulation system. Explorers are able to query and retrieve content across the entireprovenance trail for findings without having to interact with each simulation system component.User’s with this role may discover existing content or identify a lack of previous simulations for aqueried segment of the simulation system’s multi-dimensional input space.

5.4.3 Architecture

Rich component profiles are described in [142] as a method of allowing reuse of components whenbuilding digital library instances. A component profile, 5S descriptions of a component’s relateddigital objects, and the context for the component’s required functions allow existing components tobe discovered and reused. Detailed component profiles would be useful in generating entire digitallibrary instances similar to existing simulation supporting digital libraries that use different schemasand backend simulation systems. Development efforts are reduced as components, compositefunctions, and entire digital libraries become reusable. In SimDL, automatically generated storagecomponents are customizable to support a schema (e.g., utilizing RDF or RDBMS access methodsby parsing a schema’s XSD structure of elements and attributes). Content in the generated storagecomponent becomes accessible to users by generating internal components to provide access to therepositories through a service API leveraged by the generic UI. The simulation-based class of DLsare most useful in providing rich services when integrated into an existing simulation process, dataworkflow, and infrastructure.

Authorization and Access

Since large scale data intensive experimental efforts require sufficient resources to be continuingefforts, many organizations conducting such endeavors are often reluctant to share the data producedby their work. This may be a factor in determining the actual wide-spread adoption of a digitallibrary storing data, research studies, experimental metadata, or higher-level documents. Thecollaborators allowed to access archived data may be decided by a standard organizational policyor user level settings. At a minimum, SimDL is able to enforce strict access privileges of data ormetadata to authorized users within a user group.

Chapter 6

Simulation-Specific Services

The digital library technologies developed in this effort have advanced the state of the art in scien-tific DL services. Historically, large-scale scientific content has proven difficult to manage in ascalable manner. The class of DL produced here provides novel service designs or implementationsfor simulation indexing, searching, metadata capturing, metadata managing, content reusing, cu-rating, provenance tracking, and incentivizing. Formal definitions provide a basis for describingand developing DL systems, functionality, and instances. Previous publications have formallydefined SimDL in terms of content, users, scenarios, and services [87, 117]. Example simulation-specific service implementations include metadata description set generation, automated metadatavalue extraction, metadata record building, archiving, curation, preservation, ontology building,organization, commenting, tagging, rating, recommending, interface support, and fingerprintingservices. The produced formal definitions for SimDL define a DLMS with a broader scope that theimplemented DLS requires, e.g., definitions for future work in commenting and rating scientificdigital objects.

Fig. 6.1 details an expanded list of services that require novel simulation-specific implementations.Of these services, infrastructure connecting, contextual searching, curating, metadata value extrac-tion, metadata record generating, indexing, management, ranking, weighted-scientific searching,memoizing, and incentivizing have been implemented and evaluated. This set of services covers theminimal set of prototype-required services and provides a diverse set of novel services not availablein current DL systems. A subset of the high-level services can be directly implemented with theavailability of our existing basic database-centric services. The links between the high-level, layeredservices and activities denote the hierarchy of service dependencies. Note that search is typically abasic service but requires a novel, advanced simulation-specific implementation. The followingservices have been implemented in the initial SimDL framework: DLLog, DLRegister, DLIncen-tivize, DLSearch, DLFilter, DLMemoization, DLCurate, DLCurateDORanker, DLProvRegister,DLProvResolve, DLGetMetadata, and DLGetDLID.

67

Jonathan P. Leidig Chapter 6. Simulation-Specific Services 68

Figure 6.1: Organization of the content workflow, communication interface for content access andservice exposure, low-level services, layered high-level services, and user communities.

6.1 Service Layers

The assortment of SimDL services produces a system capable of meeting the digital library re-quirements for maintaining a set of simulation-based experiments. At the collection-buildinglevel, SimDL allows users to submit queries, retrieve documents, contribute content or metadata,and modify content or metadata. The information management layer provides the ability to storeand retrieve content related to the entire experimental process. This consists of a collection ofarchives containing various digital objects mapping to each stage of the experimental process.Between the collection-building and information management layers, a middleware layer providesthe context-specific inference process whereby SimDL maps user requests to operations on col-lections of content. The collection-building system provides the storage and retrieval needs ofthe digital library’s user base. Applying a metadata index-based approach to content managementprovides simplistic search on extracted actual data. Data mining on the full set of raw data requiresextensions into the data management component that manages the collection of raw digital objects.By separating the services into several layers, higher level services may be rapidly generated which


utilize underlying services.

6.2 Novel Simulation-Specific Services

A key aspect of SimDL is the ability to interface with the simulation software to produce newcontent that has not been previously simulated. The current generation of commercial and open-source digital libraries does not place a strong emphasis on storing scientific data itself apart frompublications. Outside of SimDL, no progress has been made at connecting digital libraries withresearch infrastructures to directly produce workflows of scientific digital objects. Another servicenot found in common digital library implementations is the ability to make use of domain knowledgewhen ranking the resultant set of documents related to a given query. As input parameters found ina domain schema are stored in the metadata for a given set of experiments, researchers may querythe system to determine if a given type of experiment has been attempted before. Although storagefor diverse content is available in digital library systems, e.g., Greenstone [157] and in [143], thesequential workflow of digital objects required in this context is atypical. Metadata and indexingschemes must provide the ability to store parameters for data, graphs, analyses, and reports for eachset of experiments. Existing open source digital libraries lack the functionality necessary to providethese context-specific services.

Collaboration services consist of tasks supporting multiple researchers at various stages of thesimulation workflow. Scientific collaboration is a composition of generic and simulation-specificservices, see Fig. 6.2. Collaboration consists of tasks for annotating, locating, rating, recommending,reviewing, and submitting content. Locating includes processes for user acquisition of contentthrough browsing, searching, and retrieving. Submitting includes processes for indexing, managing,and disseminating content within a DL. Collaboration reuses generic services for annotating,rating, recommending, and reviewing; locating invokes simulation-specific services for browsing,searching, and retrieving; and submitting invokes generic services for indexing and disseminating.Generic versions of these services for text-based documents are defined in [50]. SimDL currentlyhas implementations for all of the collaboration services except for commenting, rating, and tagging.As all of the epidemiology content in a model-specific collection is produced by the same simulationmodel, ratings for individual documents are currently gratuitous.

6.3 Infrastructure Communication

Service requests may be sent to the DL through a blackboard system. When SimDL is integratedin this type of infrastructure, messages or service requests are placed in the blackboard to be sentto SimDL. SimDL monitors the blackboard it is registered to and processes messages as theyarrive. Results from SimDL are then posted back to the DL and the requesting infrastructuralcomponent is notified. The services may be utilized by including the DL binary within a project


Figure 6.2: 5S graph of collaboration and related services.

Table 6.1: Blackboard-based DLBroker service.DL Broker Description

Input reads ‘BlackboardEntrys’ from an assigned Blackboard. The entry musthave a type set to ‘‘DL_Request’’. The entry specifies the DLService-Type (i.e., which service is requested) and provides the required servicearguments.

Results places result blackboard entries on the designated blackboard.

directly for better performance or in infrastructures without a blackboard component. In eithercase, a DLDataItem object will be returned by either the Blackboard or service invocation. EachDL service utilizes Java Generics for receiving service arguments. A DL service is invoked usingthe interface’s execute method with the following prototype: public <T> DLDataItemexecute(T args). In most cases, a HashMap of argument terms is sent as the argument withspecific keys required by the service and along with values for each argument term. Table 6.1details the utilization of the DLBroker. Fig. 6.3 details the SimDL service request message passingarchitecture. Request messages (i.e., entries) are placed in a common blackboard, received by theDLBroker, and processed in DLServices.

In certain situations, performance overhead is reduced by avoiding message passing through ablackboard-type system, see Fig. 6.4. In several case studies, the SimDL has been included withinexternal source code in pre-compiled binary formats. In one example, a case study’s user interfacerequired quick response times for user interactions and thus precipitated a bypass of the messagingsystem through precompiled binaries.


Figure 6.3: Message passing in a blackboard JINI-space coordinated SimDL environment.

Figure 6.4: Direct invocation of SimDL services in a SimDL infrastructure.

6.4 Curate

In systems that support scientific content, research organizations are increasingly giving long-termmanagement of collections serious consideration. However, semi-automated curation of largescientific repositories has yet to be solved and accepted in the scientific community. For thisservice, a new means of describing curation policies and semi-automating policy execution wasneeded. The long-term aspects to management include curating, archiving, and preserving. Curationis complex and composed of deciding what content to maintain and under what conditions andduration it is meant to be kept. Archival is the process of preparing, representing, and storingcontent to be accessed later with a long-term focus on storage media, format representation, andaccess tools. Archiving was one of the core functions of formal publication mentioned in [129]and an increasingly common method of distributing previous scholarly articles, albeit withoutunderlying data. Preservation is the process of ensuring that the format, meaning, and accessibility


of content is not lost over time. [74] describes the need to ensure information integrity. This entailsthat, given a bitstream or structured content, processes are needed to manage the object, dereferencethe object through identifiers, track provenance information tracing its source, and maintain thecontext of ways in which it was created and can be used. Various operations may be performedon curated collections, i.e., reformatting, copying, converting, migrating, and deleting. Curation iscurrently a difficult task as humans are poor at being able to ‘identify the valuable’ and ‘destroy theworthless.’ In [74], archivists are pessimistic about the efficacy of the natural selection of digitalinformation, as people generally do a poor job of keeping the context of digital objects but do keepoutdated files without sufficient metadata. Evidential value metadata records the founding andsubstantive activities of an organization. Informational value metadata records significant people,digital objects, and events. Together, evidential and informational metadata may be utilized toprovide DL administrators with the ability to cognizantly perform curation activities.

[69] suggests internal and external quality dimensions that can be extended to assist in curationactivities. Internal quality dimensions of content may be used to evaluate the importance of adigital object (DO) to existing processes and services, context, and role in stockholder satisfaction.External quality dimensions evaluate a DO’s importance to patrons of a DL system, expectations ofthe system, and role in providing DL performance that exceeds expectations. Collections may beanalyzed using these criteria to gauge the magnitude of DO value, percentage of utilization, percent-age of overall utilization change, costs, resources consumed in production, effect of turnaround time,and anticipatory future utilization. DOs may be evaluated to quantify the cost, benefit obtained,integrity, dependability, accuracy, currency, completeness, comprehensiveness, ease of use, andsatisfaction of stakeholders of current DL efforts. The goal of these evaluations is to ensure that aDL maintains useful content and functionality to support current and anticipated purposes related toan organization’s mission. This type of analysis is provided by the DLCurate service.

While librarians have a history of curation-related activities and operations, new curation servicesare needed in DLs. These curation services are specific to science as simulation content is storage-intensive, well structured, and reproducible under certain conditions. As such, a service that allowscuration activities to be defined in a set of rules, and a simple grammar to encode these rules wasneeded in SimDL. The societies requiring these services span collection agents, managers, curators,administrators, policy designers, librarians, maintainers, and repurposers. The curation serviceis provided though a combination of components including archived documents, handles/DOIs,metadata records, catalogs, rules, grammars, and curation operators. The goals of an automatedcuration system are to: a) properly define the required information needed to make curationdecisions, b) define appropriate metadata description sets, c) automatically collect curation-relatedmetadata based on a proposed metadata extraction grammar, d) encode policies based on rulesdefined by a proposed curation grammar, and e) automatically apply curation policies to scientificcollections. Later portions of this chapter present our framework and theoretical definitions forDL services that produce self-generating metadata collections, self-indexing scientific contentthrough metadata extraction, and self-curating collections through policy automation. Note thatself-indexing here refers to the inclusion of values into metadata records regarding specific content,as opposed to popular database indexing for efficient database functions, as demonstrated in [148].


Scientific content produced as output from HPC simulation systems may scale to infeasible stor-age requirements in many institutions. Large simulation results datasets may be summarized orreproduced in many cases. Content curation provides a means of establishing policies for contentremoval as well as arbitrary user-specified deletion. Simulation-based curation requires processesfor identifying removable documents, removing documents, and either removing the document’sindex entries or defining a method of re-generating the document. As input, curation requires a setof document handles in a metadata index, {hx, ...,hy}; a filter or query for pre-defined operationcollection identification, {q,Ci}; and a specification for the type of curation operator, co, e.g.,archiving, deleting, migrating, and preserving. Specifications for documents that should remainindexed but physically removed for storage reclamation additionally require input for a regenerationmethod consisting of a software executable, swi, and input configuration, ic f g. The output ofcuration is a suggestion for a modified collection, Ci⇒Ci

′; and altered index for the collection,ICi⇒ ICi

′. Pre-conditions for curation are ∀hi ∈ {hx, ...,hy} : ∃doi,Ci : doi ∈Ci∧hi = doi : Ci ∈Coll.Post-conditions for curation are not strict as it is possible for no DOs to match triggering criteria.In addition, each curation operator will have an alternative set of post-conditions, e.g., deletionproduces a reduced collection and migrating increases the number of items represented and stored.SimDL’s default policy is to maintain all information and rely on infrastructure designers to defineprocesses for result summarization and dataset deletion. See Table 6.2 for a description of thebehavior of the curation service.

Table 6.2: Formal service definitions for curating (deletion of content operation).User input Other input Output Pre-conditions Post-conditions{hx, ..., hy}, q Ci

′, ICi′ ∀hi ∈ {hx, ...,hy} : ∃doi,Ci : q⊆ ICi C′ ⊆C : ICi

′ ⊆ ICi :Ci,co : doi ∈Ci∧hi ∈ doi : Ci ∈Coll ∀doi ∈C′,doi ∈C

Curation Activities

Digital curation is a set of policies and practices developed for the maintenance and effectivemanagement of digital content over time. Curation elicits the need for coordinating preservation,archiving, and long term access activities. Curation in scientific domains ‘‘requires a blend of skillsand experience, including advanced scientific research and competence in database managementsystems, multiple operating systems and scripting languages’’ [72]. Curation is very important inthe scientific simulation context because of the sheer volume of data generated. To run experiments,data needs to be processed and transformed. Also, without enough documentation about the dataand its meaning, the data is rendered useless for future users. This process of transformation andsubsequent documentation is laborious and time-consuming. Also, there are few standards that dealwith how to document these different processes. [72] describes various advances in the area ofbiological data curation and suggests developing a structured vocabulary of data collected fromvarious sources. The Digital Curation Center describes a stepwise digital curation life cycle andrecommends undertaking a planned effort to promote curation and preservation throughout the data


lifecycle [83].

Archiving consists of organizing, preserving, and providing access to selected content. Due tothe nature of simulation-based research, there is often a trade-off between saving files, programs,and other materials [71]. Some simulation-based content may not require archiving when datacan be regenerated, given a software binary and an input configuration. This caveat found insimulation-systems allows for lossless data deletion, albeit with greatly increased access latencydue to the regeneration of files.

Preservation aims to ensure that content remains consistent over its lifecycle. In [55], data isclassified into two categories, ephemeral and stable. The classification is based on the capabilityto regenerate data. Ephemeral data cannot be regenerated or recomputed and must be preserved ifvalued. In contrast, stable data can be recomputed from other existing data. The decision regardingpreservation of stable data is based on other factors, such as economic and financial feasibility,storage limitations, etc. Metadata is generally categorized as ephemeral data. Simulation resultsand analyses typically are stable data. Policies for conversion (i.e., copying files and programsto each new system as new systems are introduced) and emulation (i.e., saving the data and theprogram as a bit stream, along with a description of the original machine architecture) [98] maybe processed by curation rules. As discussed in [147], the task of preservation should assure themaintenance of the value (both syntactic and semantic) of the digital content at any point of time.The laws of library science are applicable in this regard [125]. [147] gives a detailed list of severaldifferent factors that need to be a focus of a comprehensive curation effort along with essentialcuration strategies designed to foster desirable curation values.

Over long periods of time, research institutions lose the ability to simply keep and maintain allof their produced content. Reliance on individual researchers to properly maintain content is notsustainable, due to matriculation, hardware failures, improper duplication strategies, and lack ofdissemination. Institute-specific curation policies are needed to reduce data management costs, diskutilization, wasted regeneration of existing content, and duplication of human and computationalcosts. Curation policies may be used to automate removal of seldom accessed content and maintainheavily accessed or high-value content to maximize the future potential to reuse datasets. Activitybased costing frameworks for digital curation are derived from the OAIS reference model [81].The cost estimation is based on cost-critical activities identified by the functional break-downprovided by the OAIS model. Often, DB administrators have the impossible task of identifyingdata, produced by others, that may be safely deleted from storage. With a new class of self-curatingcollections, user and DBA acceptance is required to allow for curation policy automation. Rules arerequired to prevent accidental deletion of unrecoverable content. Due to changes in institutions andinfrastructures, curation policy rules will need to be continually updated. Following is a definitionof curation, a proposed curation grammar, and a representative set of example curation rules.

Formal Definition

Curation is a tuple defined as curation = (CDO,G,C), where:


• CDO is defined as a tuple denoting the set of digital objects to be considered for curation;

• G is a set of curation grammar rules, {g1, ...,gn}, applicable on the set of digital objects; and

• C is the output of the curation task indicating the sorted/modified collection of digital objectsafter executing the grammar rules, G, on the input digital objects collection, CDO.

A metadata index, or record collection, may be defined as a tuple MDOI = (C,MD) where:

• C is a collection;

• MD is a set of metadata terms, {mdi, ...,mdn};

• mdi is a tuple defined as mdi = (M,V );

• M is a set of metadata types, {m1, ...,mn}, where each mi = a scalar alphabetic value stringindicating a metadata type (e.g., file size, file type, file extension, format, regeneration cost,replicability, percent diff, usage, and lastUpdated); and

• V is a tuple, (v1, ...,vn), or a vector of values v1 to vn associated with each metadata type fromM for each digital object in a collection.

A partial decision clause is a tuple pdc = (MDOI,W,T ) where

• MDOI is a metadata index;

• W = is a set of weights, {w1, ...,wn}, assigning a value, wi ∈ [0,1], to be associated withmetadata type mi in M within MDOI; and

• T is a function of a metadata type, within the domain of M, that returns an acceptable set ofvalues or a singular threshold value associated with a particular metadata type.

Partial decision clauses are specific to metadata terms and are used to build curation policy rules.These decision clauses are set at the collection level to enable weighting and comparison of metadataterms for ranking, similarity searching, and curation policies.

A Boolean importance value function is a tuple biv f = (MDOI,PDC, IV ), where

• MDOI is a metadata index for a collection C;

• PDC is a set of partial decision clauses, {pdc0, ..., pdcn}, providing a weighting for eachmetadata term in M; and

• IV is a set of weighted values, {iv0, ..., ivn}, where ivi is the score of doi ∈C in relation to agiven set of partial decision clauses.

The Boolean importance value function may be utilized to rank a set of documents based on storagecost, regeneration cost, reproducibility, etc.


Curation Grammar

The curation grammar was constructed based on an enumeration of potential curation scenarios.From the set of curation scenarios and formalized definition of curation, an abstract set of vocabularyterms and a rule encoding structure were developed.

Example Grammar Vocabulary

• has(m): Returns true if a metadata structure contains a metadata type m.

• value(m): Returns the value of metadata type m. If m does not exist, returns null.

• contains(m,value_set) : If value_set contains value(m), returns true, otherwise, it returnsf alse.

• threshold(m): Executes and returns the threshold value for a metadata type. The thresholdcould either be a single value or a set of values, depending upon how TMx is defined.

• delete: Marks object for deletion.

• preserve: Marks object for preservation.

• archive: Marks object for archiving on external storage.

• T HRESHOLD: Globally defined threshold value, e.g., minimum importance value requiredfor preservation to prevent disk space issues.

Example Grammar Rules

1. Delete content over a certain size: i f value(file_size) > threshold(file_size) → delete

2. Delete unnecessary files by type: i f contains(file_extension,{tmp, DS_Store}) → delete

3. Preserve a set of human generated files by extension: i f value(file_extension) ∈{set_of_file_extensions}→ preserve

4. Archive older, unused content: i f (value(usage) < threshold(usage)) &&(value(lastUpdated)> threshold(lastUpdated))→ archive

With the methodology for describing curation policies as provided by the curation grammar andrules, digital libraries, such as SimDL, may periodically apply the current set of curation policies ona collection, or due to an external event (e.g., low disk space). Items deemed suitable for removalmay be removed from a collection as well as the metadata index. The curation policies and strategiesof an institution are likely to change over time. Given this, a grammar for describing new policiesand succinctly encoding curation rules is pertinent. After the curation service has processed a given


Table 6.3: DLCurate service.DLCurate Description

Input requires no arguments as all known curation rules are executed.Results contains a ArrayList<HashMap<String,String>> data item con-

sisting of four indexes. Each index in the list contains aHashMap<String,String> with keys for ‘‘curationType’’, ‘‘DLID’’,and ‘‘Score’’. This HashMap contains a list of the IDs for a set of digitalobjects. The value of ‘‘curationType’’ key is a DLCurateAction enumvalue (i.e., preserve, delete, archive, or migrate). Thus, each HashMapcontains the set of digital objects recommended for the correspondingcuration action.

collection, sets of DOs are recommended for a set of curation operations, i.e., deletion, archiving,migration, and preservation. These actions are not performed automatically to prevent the accidentalloss of data. Hence, a DL administrator must take the handles for the recommended actions andperform these actions manually. See Table 6.3 for documentation on the curate service.

6.5 Digital Object Collection Ranking

The collection ranking service is a lower level service utilized by the curation service. This serviceapplies a set of Boolean clauses to a collection and returns a ranked list of content based on themetrics found in the clauses. This service is repurposed in the incentivization service’s reports togive contributors a metric on their data products. See Table 6.4 for more details on the behavior ofthe service.

6.6 Filter

Filtering is often required in lists displayed in user interfaces. In fact, it is the only service listed as aminimal requirement in all of SimDL’s case studies. Filtering is conducted to return an unranked listof content handles that exactly match a search criteria. The distinction between filtering and searchis that search returns ranked lists of documents partially matching a query. In simulation contexts,filtering may be used to restrict results to particular users, groups, user roles, and sub-documents. Inthis last case, sub-documents might refer to a particular sub-set of input configurations parametersor results from a particular study. The formal definition of filtering is included in Table 6.5. Here, qis a k-term query, C is a collection, ids are in a set of DO identifiers, and the set of output DO’smetadata values exactly match the query.

See Table 6.6 for a description of the behavior of the filtering service.


Table 6.4: DLCurateDORanker service.DLCurateDORanker Description

Input requires an ArrayList<HashMap<String,Object>> representing a listof Boolean partial decision clauses and a HashMap<String,String>with a ‘‘combinationType’’ key with an associated value for DLCu-rateCombinationType. Each partial clause in the list is encoded by aHashMap<String,Object>. This partial clause’s HashMap must containa ‘‘weight’’ key and associated value of type Double representing the im-portance of this clause. The partial clause’s HashMap must also containan ‘‘object’’ key and associated value of type DLCurateBooleanPar-tialDecisionClause. The combination type indicates the method used tocombine the multiple ranked lists for each partial clause and is selectedfrom the DLCurateCombinationType enum.

Results contains a HashMap<String,Double> with keys containing uniqueDLIDs for digital objects. The related values are the importance valueof each digital object based on the provided partial decision clauses.

Table 6.5: Formal service definitions for filtering.User input Other input Output Pre-conditions Post-conditionsq0, ...,qk Ci, ICi {id0, ..., idn} k > 0 : ∀md ∈ q,md ∈ ICi ∀md ∈ q,qmd = idmd

6.7 Get DLID

The service to get a DL identifier is a lower level service utilized by all of the collection buildingservices. It simply produces unique identifiers to be used as handles for DOs in specific stages ofthe simulation workflow. See Table 6.7 for the formal definition, where h is a unique DO handleand DBseq is a DBMS sequence generator, e.g., of Oracle sequences.

This service supports the automated collection building services within an infrastructure. Theidentifier is utilized in numerous external components in the case studies to retrieve specificmetadata and o link records to raw data being stored in the metadata broker. See Table 6.8 for abrief description.

Table 6.6: DLFilter service.DLFilter Description

Input requires a query HashMap<String,String> with keys consisting of RDFmetadata ‘predicate’ keys and values consisting of RDF metadata ‘object’values.

Results contains an ArrayList<String> where each string is a unique DLID‘subject’ found in the RDF metadata.


Table 6.7: Formal service definitions for getting a DL identifier.User service input Other input Output Pre-conditions Post-conditionsnone DBseq h h 6∈C h ∈ DBseq

Table 6.8: DLGetDLID service.DLGetDLID Description

Input requires no arguments.Results contains a HashMap<String,String> where the ‘‘DL_ID’’ key is associ-

ated with a newly generated unique DLID.

6.8 Get Metadata

The service to get the metadata for a particular digital object is a low level service. Given an RDFtriple subject, the service returns the full listing of predicates and objects related to the subject. Theformal definition is found in Table 6.9 where a query and Collection are transformed into a set ofmetadata from the metadata index. It is utilized by the search, filtering, and incentivizing services.

Table 6.9: Formal service definitions for getting metadata.User input Other input Output Pre-conditions Post-conditionsq Ci, ICi {md0, ...,md j} ∃q,C q ∈ md : ∀md j,md j ∈ ICi

See Table 6.10 for a description of the services behavior.

Table 6.10: DLGetMetadata service.DLGetMetadata Description

Input requires a query HashMap<String,String> with a ‘subject’ key andassociated value of the RDF triple ‘subject’ for which to get metadata.

Results contains a HashMap<String, String> where each key is an RDF meta-data ‘predicate’ and the associated value is the metadata ‘object’.

6.9 Incentivize

Community building is enhanced with services that encourage and reward humans and systems tocontribute content, software, resources, activities, and efforts. As such, the high-level incentivizationservice generates reports along these lines. While considerable research has gone into growingonline communities, these services do not exist to encourage scientists to contribute and utilize thetype of applications and infrastructures supported by SimDL. The service makes use of logs and


tracks the usage of content, software, and services; and provides reports to producers and systemadministrators. The societies utilizing this service include collection agents, administrators, authors,publishers, and contributors. The activities supported through incentivizing include publicizingthe usefulness of contributions, marketing resources, and advertising systems as a repository ofentities. The components involved include collections, resources, activity logs, metrics, statistics,and reports. The formal signature of incentivizing is given in Table 6.11 where a query covers aspecific collection C, set of reports R, and set of statistics or metrics M for each report. The dataproducts are a (possibly null) bytestream encoding a textual report.

Table 6.11: Formal service definitions for incentivizing.User service input Other input Output Pre-conditions Post-conditionsq Ci,R,M str ∃Ci,R,M : ∀Mm,Mm ∈ Rr R→ str

The set of statistics provided for content producers and resource contributors in the CINET casestudy is provided in Table 6.13. These statistics are built from lower level logging of each servicethat touches a particular digital object and the context related to the access event. The servicecreates simple reports on the utilization activities and rates of various components. More complexreports execute multiple queries, e.g., that your usage has gone up over time, the kinds of peoplewho are using contributed content, how high a rating a contribution received, and how you compareto the top ranked person. The CINET case study includes seven report templates that include 45statistics. The seven reports are targeted to providers, graphs, software, results, HPC, users, andsystem infrastructure. See Table 6.9 for a further description of the service.

DLIncentivize DescriptionInput requires a query HashMap<String,String> with a ‘‘reportType’’ key

and associated value of a type of report to generate as selected fromDLIncentivizingReportType.

Results contains a HashMap<String, String> where the ‘‘result’’ key is associ-ated with a single String containing the requested report.

Table 6.12: DLIncentivize service.

6.10 Log and Register

Logging and registering are considered the same service in SimDL with a slight differences insemantics. Logging activities are restricted to the automatic recording events that produce content,e.g., a simulation configuration, contribution, experimentation or result is generated. Registering isthe explicit semi-automatic or manual recording of DOs in the DL after they have been processedby another entity. Both of these activities leverage the same underlying collection building services,


Table 6.13: Partial statistics in CINET incentivizing.Category StatisticProvider number of graphs provided

number of software applications providedaverage uses per provided graphaverage uses per provided softwarecumulative uses totalbreadth vs. depth ratio: number of studies / number of replications

Graph datestamp providedstudies used incount of individual replicationsusage distribution graph over timebibliometrics - downloads/use last x weeks, last x monthsbreadth vs. depth: used in many measures vs. thorough parameterizationpercentage of measures used with this graph

Software datestamp providedtotal times used in studiesuse distribution over timenumber of products generatedwhat graphs they were paired withpercentage of graphs used with this measurepercentage extent of parameterization

Results times reusedtimes ‘memoized’reuse distribution over timebibliometrics - downloads last x weeks, last x months, etc.generation / computation timeHPC cycles needed to generate the resultstorage footprintpotential collaborators recommendation

HPC jobs submittedSystems number of replicates

number of studies executedpercentage of studies executed on this platformrunning timedata size in/outmemoization count, ratio, etc.

Users track and provide feedback on end-users or personalizationSystem job throughput rates, job throughput delay, expected completion time, wait


known as DLLog and DLRegister. Thus, the differences are only in the processes leading to theservice call. The key requirement for these services is to allow the DL to monitor Simfrastructureresearch-based events. The formal definition is provided in Table 6.14. See Table 6.15 for adescription of the service.

Table 6.14: Formal service definitions for logging.User service input Other input Output Pre-conditions Post-conditionsDO, {md0, ...,mdm} Ci, ICi none ∃Ci, ICi {md0, ...,mdm} ∈ I′Ci

: DO ∈C′i

6.11 Memoization

Conducting large simulation study designs requires large amounts of resources, e.g., HPC systems,cycles, and scratch space. In addition, they often take days of human-intensive efforts to design anda delay of several days or weeks while the study is executed. The memoization service implementsa dynamic programming approach to linking input configurations to results for a given simulationsystem. Thus, the design space of a simulation study or request that overlaps with existing contentmay leverage the DL’s repository of existing results. This service provides researchers with shorterstudy turnaround times, immediate access to results, and discovery of similar existing studies. Italso lessens consumption of HPC systems and frees cycles for novel computations. The serviceprovides a means of storing and reusing simulation results while avoiding repeated deterministiccalculations. Memoization is a high-level service relying on collection building, getting metadata,metadata extraction, and filtering services. The formal definition is provided in Table 6.16 where aset of metadata terms is matched with the set of data manager IDs having identical metadata terms.See Table 6.17 for additional details. In the SimDL toolkit implementation, a list of identifiersis provided for matching raw digital objects found in the data manager storage mechanism. Thecalling service may access the raw data files by secure copying or through the blackboard system asappropriate.

Table 6.15: DLLog and DLRegister service.DLLog Description

DLRegisterInput requires a metadata HashMap<String,String> containing metadata term

‘‘predicate’’ keys and associated metadata value ‘‘object’’ values. TheRDF ‘‘subject’’ value is not needed as the DL automatically generates anew ‘‘subject’’ DLID and attaches to each metadata record.

Results contains a HashMap<String, String> where the ‘‘DLID’’ key is associ-ated with the String representation of an integer containing the numberof RDF metadata triples added in this logging activity. A non-negativeresult indicates successful execution.


Table 6.16: Formal service definitions for memoization.User service input Other input Output Pre-conditions Post-conditionsDO, q, {md0, ...,mdi} Ci {h0, ...,h j} md,h ∈Ci ∀md,hmd = qmd

Table 6.17: DLMemoization service.DLMemoization Description

Input requires a query HashMap<String,String> with keys consisting of RDFmetadata ‘‘predicate’’ keys and values consisting of RDF metadata‘‘object’’ values.

Results contains an ArrayList<String> consisting of data manager ‘SID’s.

6.12 Provenance Register and Resolve

Provenance tracking provides the ability to discover the source and generation process for a DO.The service utilizes a metadata index of provenance links between DO handles to capture theworkflow of content. Simulation systems produce a stream of content at multiple stages includingmodel schemas, input configurations, results, analyses, documentation, annotation, and publication.Preserving provenance information from findings and results is necessary for simulation researchers.Provenance is maintained by defining each relationship between content and harvesting contentcollections for each stage of the experimentation process. Similar to indexing a document doi inan index ICi , provenance information can be encapsulated in an index structure where two sets ofdocuments, {doi, ...,dok} and {do j, ...,dom}, are indexed in P{doi,...,dok},rel,{do j,...,dom}. Provenancestreams in a collection may be defined by the tuple P = (h(doi,...,dok), {reli, ...,relk}, h(do j,...,dom)),where

1. h{doi,...,dok} represents handles for a preceding set or type of digital objects from Ci, a collectionin Coll;

2. rel is a provenance relation, where reli is a relation between doi and a subset of (do j, ...,dom);

3. h{do j,...,dom} represents handles for a successor set of digital objects from C j, a collection inColl; and

4. {doi, ...,dok} ∈Ci; and {do j, ...,dom} ∈C j.

The formal definition is completed in Table 6.18 where two DOs are linked by a provenancerelationship term. See Table 6.19 for more details on the provenance registration service. See Table6.20 for more details on the provenance resolver service.


Table 6.18: Formal service definitions for provenance tracking.RegisteringUser service input Other input Output Pre-conditions Post-conditionsdoi,do j,rel Ci none doi,do j ∈C (doi,rel,do j) ∈C′

ResolvingUser service input Other input Output Pre-conditions Post-conditionsdoi Ci jn

0{doi,rel,do j} doi ∈Ci {doi,rel,do j} ∈Ci

Table 6.19: DLProvRegister service.DLProvRegister Description

Input requires a query HashMap<String,String> with keys ‘‘subject’’, ‘‘dl-provtype’’, and ‘‘object’’. The values associated with ‘‘subject’’ and‘‘object’’ must be valid assigned DLIDs. The value associated with‘‘dlprovtype’’ is selected from DLProvType and should be read as a leftto right statement (e.g., ‘subject DLID’ ‘precedes’ ‘object DLID’).

Results contains an Integer describing the number of new provenance-relatedmetadata triples that were registered (e.g., 0 or 1).

6.13 Search in Scientific Collections

Search over scientific collections required novel techniques and algorithms to process highlynumeric, scientific content with little to no free text. In the course of several decades of informationretrieval research, several thousand weighting schemes have been introduced and tested, see [26].Salton experimented with term weights in the 1960s though most active schemes have been producedfrom the 1970s [131]. Salton, Allen, and Buckley summarized the 20 years of weighting workconducted within the SMART system [132]. In this work, over 1,800 term weighting factors weretested experimentally. A detailed overview of similarity functions (e.g., cosine) is given in [155].Historical work in this area has proposed approaches to improving search, e.g., change the document

Table 6.20: DLProvResolve service.DLProvResolve Description

Input requires a query HashMap<String,String> with keys ‘subject’ and ‘dl-provtype’. The value associated with ‘subject’ must be a valid assignedDLID. The value associated with ‘dlprovtype’ is selected from DLProv-Type and should be read as a left to right statement (e.g., find results X,where ‘subject DLID’ ‘precedes’ X).

Results contains an ArrayList<String> containing RDF metadata triple ‘objects’that have the requested DLProvType relationship ‘predicate’ with therequested ‘subject’.


collection through document transformation, reformulation of the query, modifying term weights,query splitting, query expansion, and (Rocchio and pseudo) relevance feedback. This treatment ofdiscovery services is split into searching, ranking, browsing, and dissemination.

At the core of the discovery services is an RDF-triplestore metadata index. With this index, activitiessuch as filtering, providing access, archiving, requesting, searching, and selecting are supported.The index is utilized to discover a set of handles or DOIs in a catalog that are likely to be relevantto a request. Ranking is a complex process typically relying on probabilistic information retrieval,degree of similarity, metric spaces, and indexing services. With known metadata indexes, we canevaluate the accuracy, pertinence, relevance, precision, and recall scores of search implementations.

Browsing scientific content is similar to browsing textual documents with the added assumptionthat documents will be displayed in a faceted design based on model and workflow stage. SimDLhas implemented faceted browsing for a clustered document space based on the simulation modelsource and the stage of the workflow producing each type of content as described by existingontologies. Browsing requires as input an initial simulation facet, anchor f , and potential targets,Hyptxt j. Browsing produces as output a set of digital objects, {doi : i ∈ I}. The pre-conditionfor browsing is that the anchor must exist in the network of hyperlinks, anchor ∈ Hyptxt j. Thepost-condition restricts the resulting digital objects to existing documents in the collection valid forHyptxt j, ∃C ∈Coll : {doi : i ∈ I} ⊆C.

Existing search functions are well suited for full-text documents and textual metadata. Simulationcontent requires search over numerical metadata and summarized content. The metadata for typicalSimDL content consists of numerical values for metadata terms. As an example, the metadataschema for a collection of input files to a simulation model will largely consist of selected inputterms. Traditional similarity metrics break down in this situation as the semantics and similarity ofmultiple metadata values are non-trivial to determine. Search implementations relying on text-basedmetrics, such as term frequencies, thesauri, dictionaries, stemming, and term co-occurrence, breakdown in scientific digital libraries. However, k-length queries may be segmented into k Booleanclauses for numerical metadata terms. In this type of querying system implemented for SimDL,a similarity score between a document and query is provided by a weighted summation over thek clauses as presented in [88]. The formalism for scientific search is identical to full-text searchas a query (q), collection (Ci), and index (ICi) are required to produce a set of weighted results,{(doi,wqk) : qk ∈ q}. A detailed formal definition of the technical aspects used to implement a searchservice are required to differentiate between full-text and numerical-based search mechanisms.

In ranking and searching of configurations, datasets, analyses, or documents in experiment sup-porting DLs, it is useful to rank documents for queries based on annotations or selected input andanalysis parameters. Query processing in a high-dimensional space scheme might rank documentsbased on how each document’s icfg parameter values match a Boolean or weighted query. Whilethis ranking scheme is sufficient in providing a generic search mechanism, it is intended as astarting point for interested developers to customize search functions based on context specificfactors accompanying a context schema. Through semantic ties across the multiple collectionsof experiment-based content, it is possible to retrieve the stream of documents that constitute a


provenance trail of research findings. The search functions used in SimDL are a high-level servicerelying on previous collection building activities and available metadata indexes. The query agent,i.e., human or infrastructure component, specifies a data type, query model, and k query terms. Theinternal algorithm uses Boolean weight functions to rank scientific metadata terms. The exact form

is similarity(Q, D j) = ∑ki=0 pi∗vi, j

∑ki=0 pi

where

• pi = user defined parameter weight, default 1/k and

• vi, j = Boolean match of term i in RDF metadata for j.

This implementation scales linearly for metadata index scales and completes in .14 seconds ona small collection of several hundred items. The current implementation has a limitation thatk separate queries and combination are performed and will need to be optimized in distributedimplementations. The strength of the algorithm is that it works well with scientific metadataschemas and highly numeric content. The formal definition is completed in Table 6.21. See Table6.22 for a description of the search service.

Table 6.21: Formal service definitions for searching.User service input Other input Output Pre-conditions Post-conditionsq0, ...,qk, w0, ...,wk Ci {(h0,w0), ...,(hn,wn)} ∀qi,qi ∈ ICi ∀hi,hi ∈Ci :

∀w,0.0≤ w≤ 1.0

Dissemination of scientific content consists of exposing content and providing the ability to retrievespecific content. Exposing is a composition of general services for acquiring, cataloging, indexingto support search functions, and classi f ying to support faceted browse functions. Retrieval consistsof requesting a document, ∃C ∈Coll : doi ∈C; identifying handles for the requested document,hi = doi; and provision of the document as output, {doi}. The pre- and post-conditions for retrievalare ∃doi,C : doi ∈C∧hi = doi and doi = hi, respectively. Exposing and retrieval are fully supportedin our implementation of SimDL.

The use of schemas by SimDL allows digital library services to make use of contextual informationwithin a domain. Discovery is improved as DLs understand the environment in which queriesare performed. This understanding is accomplished by restricting the search space to a contextdefined by an input schema and accompanied by additional domain information (e.g., the how,what, where, and why of anticipated queries). Similarity is based on distance in a high dimensionalspace that includes input parameters, annotations, and simulation-related metadata. By restricting aDL’s focus to content within a context, similarity ranking may be intelligently assisted by domainexperts. Contextual ranking becomes important as content is searched from large systems andcollections. The institutes involved in SimDL’s development had previously limited metadata tobasic information describing simulation runs and a brief tag entered by study designers or analystswithout advanced experience in computing or data management. Pairing user annotations with


Table 6.22: DLSearch service.DLSearch Description

Input requires a complex structure to represent the weighted query,ArrayList<HashMap<String, String>>. Each item in the ArrayListrepresents a single weighted search clause (e.g., a four term query mighthave: metadata predicate ‘‘graphProvidedBy’’, metadata object ‘‘jlei-dig’’, and weight ‘‘.25’’). Each clause is encoded in a HashMap withkeys ‘‘term’’ associated with the RDF predicate term identifier value(graphProvidedBy), ‘‘value’’ associated with the RDF object value(‘‘jleidig’’), and ‘‘weight’’ associated with a string representing a doublevalue for the individual’s importance. The weights are combined basedon an arithmetic mean. Thus, a straightforward weight assignment of ak-clause query might include weight values between 0.0 and 1.0 with asummation of all of the weights equal to 1.0. Alternatively, a 3-clausequery with weights 2.0, 3.0, and 4.0 would have a total weight of 9.0and possible similarity scores of 0.0, .22, .33, .44, 55, .66, .77 and 1.0.

Results contains a HashMap<String, Double> where each key is a ‘DLID’ for adigital object and the associated value is a numerical value representingthe similarity of the digital object to the query.

SimDL’s schema-based contextual information results in improved querying and discovery incomparison to previous browsing functions. By automating the RDF-based storage design to matcha contextual schema, the storage structure mirrors the semantics of a context and allows processeswith access to the schema and storage system (e.g., data mining algorithms) to intelligently performoperations on existing data. Though requiring additional effort from a domain and simulationsystem expert, DL component and service optimization may occur by taking advantage of contextualinformation to alter the weights in a Boolean query.

6.14 Supporting Metadata Functionality

Supporting metadata using this approach involves processes for generating metadata description sets(schemas), extracting metadata values, forming metadata records, and indexing metadata records.In simulation-based science, hierarchical multi-resolution access to data is required and calls forcareful metadata planning. ‘‘Data analysis must ultimately be done at the highest available spatialand temporal resolution, yet there is far too much data for all of it to be based at this level. Therefore,the analysis process often requires a number of intermediate stages’’ [113]. Extracted metadatacan be considered on a plane one level above raw data. An RDF metadata graph can be indexed insimple DL storage and linked to data stores of DOs. The overall approach is to store a raw level ofdata, a level 2 description with details on the DO’s generation process, level 3 data approximationwith extracted metadata, and finally a level with human-generated tags.


Digital libraries universally require processes to index metadata for supported collections. Theperformance of DL services is greatly influenced by the system-wide method of content indexing.Individual services typically can not take advantage of optimally defined metadata indexes. What isneeded is a framework for indexing content in graphs, defining the metadata required by services,and connecting components to interact with SimDL. Here, I present a graph-based index andservices framework for digital libraries. The design uses graphs and formal definitions to supportservices at typical and metamodel levels for powerful, general capabilities. From this, I develop ametadata model in which formal terms are defined within the 5S framework.

We previously described a framework that produced interoperable metadata description sets andharmonized model vocabularies, presented a software implementation of the framework, and demon-strated the performance of the software prototype of an existing digital library. This frameworkwas presented in [95]. The publication contains a complete discussion of the metadata descriptionset generation framework, description sets and records, description set generation software, harmo-nization and automated metadata extraction. The discussion presented here provides a treatment ofbuilding description sets, extracting metadata from content, building metadata records, and indexingthe records.

6.14.1 Generating Metadata Description Sets

Metadata description sets, also called metadata schemas, provide a structure for representing contentthrough metadata terms and values as defined by the Dublin Core community [153]. The taskof developing standards compliant metadata description sets needs to be automated for scientificresearchers, who typically lack expertise in information science. This may be provided throughsemi-automated generation software and automated extraction tools. The approach used here issimilar to automated relational index suggestion tools found in the Microsoft Index Tuning Wizard,IBM’s DB2 advisor, and Oracle’s Tuning Pack. Previously developed metadata description setgeneration software [95] takes several representative examples of simulation input and outputfiles; recommends terms for potential inclusion in a metadata schema; allows user selection of theimportant terms; adds Dublin Core terms, infrastructure terms, and formal terms; and generatesa metadata schema. Fig. 6.5 displays a representation of the organization of simulation-basedmetadata terms. See [95] for a discussion of methods and software for generating description sets.

6.14.2 Extracting Metadata Values

Content generated by the execution of simulation models must be automatically processed dueto the volume of digital objects produced. Fortunately, metadata extraction need not requirehuman-intensive efforts as ontologies describing a domain and type of object may be used to builda metadata description set for each type of content. Pairing the description set with a metadataextraction script supports automated, run-time metadata record harvesting. Metadata extraction fora specific type of document requires an ontology, ont; metadata description set, MDs; extraction


Figure 6.5: Metadata description set term organizational structure.

script, exts; and a document harvested by a content broker, doi. The output of the extraction processis the inclusion of doi in the collection index, Idoi . The pre-conditions are an existing collection,Ci ∈Coll, and script exts to extract content from doi to Idoi as described in MDs. The post-conditionis a modified collection index, I⇒ I′, where Idoi ∈ I′. Metadata extraction in SimDL is handled byextraction scripts paired with a model ontology and is provided by simulation model developers. Itis assumed that all input files or output files from a specific version of a simulation model may beprocessed by the same extraction script.

Automated metadata extraction remains an open problem in information retrieval and management.Minimally, the goal of automated metadata extraction is to process files of a given type, determinemetadata values for specific metadata terms, generate a metadata record with a given structure,and index the metadata record within a metadata collection for the given type. Even without thesemi-automated development of quality metadata description sets through generation software, it ispossible to implement automated metadata extraction for any static description set. The difficulty inmetadata extraction lies in producing a well described process for extracting and continuing theeminence of the current description sets’ structure, extraction procedures, and record formatting.

An additional DL component is needed consisting of a grammar and software implementation toform an automated metadata extraction process. Users will need a language (grammar) for defininghow to automatically extract (data mine) actual values from an individual digital object for all ofthe metadata terms described by a given metadata schema, and to generate a complete metadata


record for the digital object. A software implementation takes as input: a) a metadata schema for aspecific file type, b) an automated extraction specification written in the extraction grammar, and c)a digital object of the file type; then it produces a metadata record for the digital object.

Formal Definition: Formal definitions of metadata extraction and curation allow for preciseunderstanding, implementation, and use within the automated curation framework. The followingdefinitions build on previously specified formal 5S definitions [44].

Metadata extraction is a tuple defined as ME = ({DO},M,G,Mdr) where

• {DO} is a set of input files, {do1, ...,don};

• M is a set of metadata terms, {m1, ...,mk}, requiring metadata extraction;

• G is a set of grammar rules, {g1, ...,gr}, for metadata extraction into a specific record structure;and

• Mdr is a set of metadata vectors, {v1, ...,vn}, structured in records, as returned from themetadata extraction tool.

MDO = {mdo1, ...,mdon} where,

• mdoi = {K,T};

• K is a set of ‘t’ metadata types, {k1, ...,kt}, whose values need to be extracted from the file;

• T is a set of file types. A file type is a combination of file extension and file structure, {e,s};

• e = file extension; and

• s = file structure, e.g., row-based or column-based.

Metadata Extraction Grammar

A metadata extraction grammar is needed to explicitly describe the process of mining metadatavalues from a structured file. The grammar described here is based on the standard types ofstructured files used as inputs and outputs to simulation systems, e.g., TXT, CSV, RDF, andXML. These files are based on a single simulation software version and are structured in row orcolumn orientation. In some cases, metadata values will appear next to term identifiers, whileother value determination may require processing (e.g., column summation). Processing XML filesrequires a different set of operations, as the structure of an XML file may be leveraged to accessvalues. The metadata extraction grammar builds upon the following formal vocabulary for terminalsymbols. Note that grammar production rules merit a separate lengthy discussion. This grammar


was constructed based on a variety of structured file exemplars used at the Network Dynamics andSimulation Science Laboratory at Virginia Tech.

Grammar Vocabulary

• delimiter: The delimiter between tokens in a file

• extract: Extracts and returns the token at the current location in the file

• goto_row(r): Move pointer to beginning of row r in the file

• goto_column(c): Move pointer to beginning of column c in the file, with columns separatedby delimiter

• goto_next_token(t): Move pointer to next token t in file

• get_value: Extracts and returns the associated value for current token

• sum_column(c): Extracts and sums up all the values in column c and returns the result

• next_token_in_row: Advances pointer to next token in row (if it exists)

• vki: Metadata value for term ki

Note that this partial vocabulary is required to describe the provided example grammar rules and anextended version provides additional functionality, e.g., average_column(c,row_i,row_ j).

Example Grammar Rules: With this grammar, a set of metadata extraction rules may be producedto describe the methodology for mining values from a specific type of simulation-related file asspecified in a single metadata description set. The set of rules may be applied on the set of files Fto generate metadata records for these digital objects. The following are two representative rules asencoded in the grammar. The first example processes a structured file, e.g., simulation input file,that contains pairs of input terms and input values co-located on the same line. The second exampleprocesses a column-oriented file, e.g., simulation output file, that contains data points to be totaledper column. Note that single value selection rules, summation rules, etc., may be combined on aterm-specific basis, i.e., each ki is described by a separate rule.

• IF e = ‘txt’ and t = ‘row-based’ (a uniform text file of row-based {label, value} pairs)→delimiter = ‘,’; For each row x,

– goto_row(x)– i f extract ∈ K, next_token_in_row; vki = extract;

• IF e = ‘txt’ and t = ‘column-based’ (a uniform text file of column-based labels for summationacross all rows)→ delimiter = ‘TAB’; For each column x,

– goto_column(x)– i f extract ∈ K, vki = sum_column(x);


6.14.3 Forming Metadata Records

Metadata records are generated by leveraging the automated metadata extraction process to producea set of descriptive metadata for a DO. The metadata is then added as predicates and objectsto the DO’s subject identifier in the RDF metadata index. The completeness of records, andconformance to a description set depends entirely on the automated metadata extraction service’sability to discover and extract metadata values from a DO. As such, the conformance of metadatarecords is directly related to the current metadata extraction’s mining protocols, i.e., what termsand term values to search for within a DO. As the metadata record building process is automated,the collection completeness is guaranteed for all DOs processed by Simfrastructure. SimDLuses the description set generation and harmonization software and is currently in use to manageepidemiology simulation models and related content. Metadata description sets for each of the threemalaria and influenza models were generated, see Fig. 6.6 for an example. A metadata broker wasdeveloped to harvest input, output, and result term values as well as to create metadata records eachtime the simulation model software is executed. The model-specific metadata schemas and recordsare used in all index, browse, search, discovery, dissemination, verification, and preservationservices of SimDL. The metadata records are produced in RDF form to allow queries of specificrecords. Metadata produced through the framework provides description set interoperability betweenresearch groups, simulation systems, multiple models, and stages of the simulation workflow.

6.14.4 Indexing Metadata Records

The RDF Triplestore is a database for storing and retrieving triples of metadata records in a graph.Each triple contains a subject, object, and predicate or relationship between the subject and object.RDF Triplestores are composed of a graph of individual RDF triples, i.e., (subject, predicate, object).Network graphs provide a means of logically managing metadata for a collection. Overlay networksprovide a means of indexing connections between digital objects and related metadata values.Service-specific overlay networks can be developed in a digital library through motif-identificationalgorithms. Motifs in graphs are small networks, or subgraphs, with a defined structure. In SimDL,the pairing of individual input configurations with the related simulation results produced fromthe configuration forms a motif. Motif identification is useful in enumerating building blocksof subgraphs and the functional properties of each type of motif. Multiple approaches exist formotif identification including color coding, back-tracking, and dynamic programming. Parallelsubgraph identification algorithms, e.g., PARSE [161], are required for large graphs such asmetadata indexing for collections. Initially, the indexing and searching services were designed withOracle RDF processing implementations. While distributed, scalable motif-identification methodsare available in our case studies; the SimDL toolkit provides a non-parallel RDF search service forsmall, medium, and large sized collections. Our aim is to produce a graph-based mechanism forindexing all types of content. Graphs provide a generic method of organizing content. The powerof formalized search can be demonstrated in two contrasting domains.

Metadata Structure and 5S Graph Representation: Content may be described by Dublin Core [153],


Figure 6.6: Metadata description set for input configuration files to the EpiSimdemics model. ThisXML representation is simply an alternative, concise translation from the RDF-triplestore used tostore the metadata records. Each of these lines of XML is stored in a triple, e.g., (‘id:0’, ‘dc:creator’,‘user:leidig’).


domain-specific, infrastructural, and 5S terms, and can be represented in a graph structure. Themetadata schemas for SimDL systems include 5S and Dublin Core terms. Simple 5S terms definebasic concepts. SimDL includes formally defined terms such as input file, analysis type, studydesigner, and simulation model parameters. In addition to simple terms, relationships and termhierarchies can be encoded with 5S. For example, the Dublin Core terms of audience, contributor,creator, publisher, and rightsHolder are formally described using 5S society and scenario definitionsin SimDL. The metadata terms for SimDL may be indexed in a graph to support sub-graphidentification for building metadata overlay reasoning.

Graph Construction: Every document in managed collections has a metadata record. Completemetadata records have a metadata value for each of the metadata schema terms and are used forsearch purposes. The metadata schemas these projects contain could be defined in RDF or XMLSchema formats. Individual records are stored using RDF or in database tables.

RDF graph-based services require a labeled, undirected 5S graph G = (V,E), a set L of labels, anda labeling function for edges l : E→ L. Individual documents and users are nodes vi ∈V . Metadataterms are represented by edges ei ∈ E with a label lei and connect nodes with matching metadatavalues. Note that G is a sparse graph in SimDL as few documents have identical metadata valuesfor any given term. However, as k metadata terms form k potential edges between a pair of nodes,|E| ≤ kn(n−1)

2 .

6.14.5 Summary

Generating interoperable metadata terms and description sets between simulation models andinstitutions is a non-trivial process. Similarly, cross-community collaborations on building domainvocabularies and ontologies are not simple or straightforward endeavors. Determining semanticrelationships between models is technically difficult and has historically been unsuccessful. Often,modeling groups within a domain compete for funding and disagree on modeling paradigms. Asan example, there are disagreements whether high-level differential equations or disaggregatedagent based models are more appropriate for given domains. There are also disagreements on thesemantic meaning of commonly used terms in a domain vocabulary. Further, the technical use ofterm values vary widely between simulation model implementations. The large and diverse natureof sub-fields within a domain, such as computational epidemiology, generally means scientistscannot easily create meta-ontologies for a domain. This leads to fractured ontologies coveringsub-fields of a domain with only partial overlaps. Further analyses of modeler produced descriptionsets and ontologies are required before we attempt to build a single, cohesive vocabulary, ontology,or term namespace for the simulation models and domains in epidemiology.

We have employed several tools to mitigate these challenges in the short term. Model vocabulariesare constructed within a domain as research groups produce description sets. Taxonomies andontologies are formed in each description set through 5S definitions of term structures and rela-tionships. Simulation efforts frequently involve multi-institutional, multi-national collaborations.


Model developers may provide multiple XML schemas of input, result, and analysis terms indifferent languages when producing description sets. These multi-lingual schemas have been shownto provide model-specific multi-lingual thesauri for model terms in our previous internationalizationefforts [117]. Importantly, semantic relationships between model terms are typically easier to definebetween multiple models developed by the same simulation institution, as scientists are intimatelyfamiliar with each model. Thus, a two-tiered harmonization approach may be most appropriate todefine semantic matches within an institution to consolidate terms for the more difficult process ofinter-institutional harmonization.

A major limitation of the current description set generation framework is the reliance on domain-modelers to choose good terms from the suggested terms. Description sets may require modificationat a later time if the description set lacks terms needed for future digital library services. Futureextensions to the framework may involve user profiles and required services per user role to refinethe set of suggested terms. Similarly, the inclusion of unnecessary terms recommended by thesystem and approved by the user will make poor use of metadata storage space. There is noguarantee a modeler will adhere to good term criteria, defined in [20], when generating descriptionsets without previous training or expertise in term selection.

The framework consists of a semi-automatic production of metadata description sets and a processfor automated term value extraction and record generation in simulation systems. The frameworkwas used to define the metadata implementation used in SimDL. It also has potential for additionalimplementations in scientific data management efforts by other groups in non-epidemiology do-mains. We have developed a metadata description set generator and harmonization implementationof the framework. This software has enabled digital library services for the computational epi-demiology modeling and simulation community that is currently hindered by a lack of ontologiesand semantically linked metadata schemas. The contributions of this portion of the work are aframework and software implementation to efficiently produce description sets (containing DublinCore, model, infrastructure, and 5S terms), harmonize terms between models, and automaticallygenerate records for content. Future efforts are needed to incorporate the 5S Framework TermWizard for adding 5S definitions into RDF description sets by modelers without expertise in for-mal definitions. Additionally needed are studies to evaluate the harmonization software after thegeneration framework has been used to produce a larger corpus of diverse description sets.

Chapter 7

Evaluation

Crucial to the adoption of the SimDL framework are positive results from case studies and anevaluation of the performance, coverage, and scalability of SimDL and its services. However, digitallibrary evaluation is recognizably an extremely complex and difficult task [46, 133]. [30] suggestsan approach to evaluate elements and criteria. Evaluation elements are described in multipleentities such as network components, technical infrastructure, information content, informationservices, and use-case support. Evaluation criteria define the measurement for each element, i.e.,extensiveness, efficiency, effectiveness, service quality, impact, usefulness, and adoption. Theefficiency measurement may be broken down into details of a service implementation’s metadatafunctions, database functions, input and output processing, and logical processing. Evaluations ofmetadata functions utilizing RDF triplestores generally consists of measurements with the LehighUniversity Benchmark (LUBM), real data from UniProt, or domain-specific data. In previousquality of service metrics for application cluster efforts [116], database function measurementsconsisted of the time taken for a request to connect to the DB, time taken for the request to beserved by the DB, time taken for the DB to initialize, time taken for the DB to shutdown, time takenfor the DB backup to complete, per process CPU and memory utilization, number and durationof users connected to the DB, memory allocated at startup, DB checkpoints, and memory pertransaction. The remaining efficiency measurements are considered in terms of the average spaceand computation requirements needed to execute the service. As there are no existing parallel serviceimplementations, efficiency of SimDL is assessed by acceptable thresholds for service performance.Performance results for services must be reasonable and not too inefficient to prevent widespreadadoption in case study infrastructures. Thus, the end-to-end processing for a service should take nolonger than a limited number of seconds for services exposed to end-users. A formal evaluationmay be utilized to show the descriptive power in formalisms and the complete description of theSimDL framework. They also may prove that a set of services fulfills a particular use case scenarioand the full set of case study scenario requirements. It is expected that SimDL may save time andeffort for end-users in tasks such as designing a study, conducting studies, identifying similar work,managing and repurposing previous work, and conducting standard analyses. However, end-userevaluation studies are not conducted due to the focus on evaluating system performance.

96

Jonathan P. Leidig Chapter 7. Evaluation 97

7.1 Methodology and Evaluation

As published in [92], the following three components may be used to guide the evaluation of thesuitability of SimDL for scientific settings.

7.1.1 Formal Descriptions

Theoretical foundations for users, content, and services provide a means of explicitly definingdigital libraries. With these definitions, services can be implemented and reused between DLefforts. Here, the 5S framework notation is used to define SimDL while making use of previousdefinitions. Content including schemas, model-ontologies, input configurations, sub-configurations,results, analyses, and experiments that have been formally defined for SimDL [87]. User roles werealso defined for SimDL including tool builders, administrators, systems, study designers, analysts,annotators, and explorers. Formal definitions of simulation-specific services were proposed inparallel work [87]. Thus, full formal descriptions of users, content, and services have been produced.

7.1.2 Implementations

Formal definitions of services, in 5S notation, include an informal description, required inputs,produced outputs, service pre-conditions, and post-conditions. The process for developing a serviceincludes task identification, definition, and implementation. After service definition, developerspreviously had leeway to implement the service however necessary, as long as the input format,output format, and conditions are met. With the inclusion of implementation details in 5S notation,the defining and implementing stages have merged to allow practical constraints to inform formaldefinitions of the approach taken. Further work is required to determine if the current definitionnotation is sufficient for guiding the complete implementation of services.

7.1.3 Interoperability

SimDL supports user, architecture, and functionality interoperability through simulation-specificservices. SimDL provides an API to allow user interfaces to interact with content collections.Multiple generic interfaces were built using this API in epidemiological and network sciencecase studies. Thus, users have consistent interface presentations between models and systems.Architectural interoperability is provided through the brokering system communication with externalcomponents and infrastructures. With architectural interoperability, SimDL may be integrated intomultiple simulation infrastructures that can provide adaptors to the brokering system. Functionalinteroperability is attempted through formal definitions of the inputs, outputs, pre-conditions, andpost-conditions of services. With formal definitions of service inputs and outputs, it is possible to


reason over compositions of interoperable services, interoperability validation, and interoperabilityservice registries.

7.2 Framework Evaluation

The SimDL framework is evaluated in its multiple forms as a digital library management system,DL system, and instance. The minimal framework effectiveness is achieved when the full setof DLMS functionality requirements are covered in a fully expanded DL instance. As a DLMS,SimDL is evaluated based on completeness and effectiveness of its defined components. Theservice completeness is demonstrated by comparing the provided coverage found in the DLS toolkitto the anticipated functionality requirements. The desired functionality was gathered from use casesof the predefined user roles [87]. The effectiveness of the DLMS design is determined by whetherits definitions led to a DLS that provides the functionality required. As a DLS, SimDL is evaluatedin terms of the implementation performance and deployment effectiveness. To provide effectiveDL instances, the DLS must contain effective and efficient services. The Boolean effectivenessrating of a service is assigned based on whether the DLS provisions an appropriate service inthe DL prototype that is able to fulfill a given set of functionality requirements. The minimallyacceptable DLMS effectiveness is achieved if the full set of DLMS functionality requirements arecovered in a fully expanded DL instance. The efficiency was determined by measuring the responsetime, computational cost, internal space requirements, and parallel scalability limits of each servicewithin the DL prototype instances. The deployment effectiveness considers the usefulness of theDLMS and its DLS implementation for developers within simulation groups. Specifically, thelines of DLS software code that are used in each prototype are counted. As the software reuse wasdifferent in each prototype instance, the lines of code (LOC) measurement was normalized withnumber and complexity of the DLMS functionality required by each prototype. In addition, thelines of ‘wrapper’ code needed to make use of the DLS within each infrastructure was counted.This evaluation determined if SimDL as a DLMS and DLS is reasonably effective and generalizableacross deployment instances.

7.2.1 DLMS Completeness

DLMS completeness is evaluated by comparing the DLMS definitions to the functionality requiredin DL instances from use case modeling. Simply put, does the set of DLMS descriptions fullyspecify a class of DLs that would lead to DLSs and DLs desirable for a community? See Table 7.1for further details on this evaluation based on expected user roles presented in [87]. Refer to earlierdescriptions on formalisms for the full specification of each DLMS definition. The minimal set ofSimDL DLMS definitions was guided by and fully specifies the anticipated user requirements forthe modeled user roles. While additional services may be provided to support extended functionality,the described DLMS covers the complete set of minimal functional requirements, as shown in 7.1.


Table 7.1: DLMS definition completeness.Use cases Relevant DLMS definitionsBuilding tools Communicating, getting DL identifiers, generating schemasAdministration Curating, rankingSupporting systems Communicating, memoizingDesigning studies Registering content, registering provenanceAnalyzing Supporting metadata,Annotating Registering metadataExploring Searching, filtering, discovering, getting metadata, resolving

provenance

7.2.2 DLS Software Toolkit Effectiveness

The toolkit must provide support for the set of functionality and content defined in the DLMS.While case study DL instances often require a subset of the full set of available functionality, thetoolkit must implement the functionality required by the set of instances it is tasked with supporting.Extended formal definitions in a DLMS have neither a software DLS implementation nor casestudy instance that requires the functionality described in the definition. These definitions areconsidered ongoing extensions or provisional efforts and not considered in the DLS effectiveness.The suitability of DLS service implementations as measured in the DLS toolkit efficiency evaluation.Services provided in a toolkit are assumed to be of sufficient quality to be considered an effectiveimplementation of the DLMS requirements. Table 7.2 details the effectiveness of the SimDLsoftware toolkit to the formal definitions as described in Fig. 5.5 and Fig. 5.6. Many tasks arehigh-level abstractions of a single DL service, e.g., contributing, indexing, storing, and provenanceregistering are separate tasks supported by the DLRegister service. Note that the DLGetMetadataservice returns data manager identifiers, allowing components to access and retrieve raw digitalobjects directly. In every case, the provided software toolkit provides services to implement anidentified minimal functionality.

7.2.3 DLS Software Toolkit Efficiency (Service Performance)

The efficiency of the DLBroker is given in Table 7.3. Each measurement is the mean of 1,000performance tests. The blackboard connection time is only incurred once at DL startup. Theremaining costs are incurred with each DL service request.

The efficiency of each DLService is provided in Table 7.2.3 as a reference point. The results aregiven on small (i.e., 100 item collection), medium (i.e., 1,000 item collection), and large (i.e.,100,000 item collection) metadata indexes to show the scalability of the current implementations.Each value is the average time spent in a portion of code over 1,000 executions of the service. Inalignment with Dublin Core, RDF triplestore metadata indexes were generated to contain fifteen


Table 7.2: DLMS definitions and DLS toolkit comparison.Formally and Informally DescribedTasks

DLS Toolkit Service

Browsing DLSearch, DLGetMetadataCommunity growing DLIncentivizeComparing DLMemoize, DLGetMetadataConsuming and retrieving DLGetMetadataCurating DLCurate, DLDORankerDiscovering DLSearch, DLFilterDisseminating DLSearch, DLFilterIncentivizing DLIncentivizeIndexing DLRegisterLogging DLRegister, DLProvRegisterMetadata extracting External component-specific codesMetadata record generating DLRegisterMetadata schema generating Off-line software and UIReusing DLMemoization, DLGetMetadataProducing and contributing DLRegisterProvenance tracking DLProvRegister, DLProvResolveSearching DLSearchStoring DLRegister, DLProvRegisterUsage tracking DLIncentivize, DLGetMetadataWorkflowing DLProvRegister, DLProvResolve

Table 7.3: Evaluation of the DLBroker in milliseconds.Blackboard Processing and service Blackboard Blackboard

connection time selection overhead posting time retrieving time2,500 0.1 33.2 68.5


elements. For the discovery services, all queries were composed of two terms, in alignment withthe current average term-length of popular search engines, e.g., Google. Here, n represents thenumber of DOs in the collection and the metadata index, r denotes the number of rules or reports(composed of multiple clauses), and c gives the number of clauses (DB queries) in a rule. I haveoptimized these algorithms to scale with regards to n. However, in environments with smaller,static collections and complex reports or queries, it may be possible for alternative algorithms to beoptimized in terms of r and c.

7.2.4 Service Generality

We expect a reduction in software development efforts when integrating and utilizing the SimDLtoolkit in comparison to customized, institute specific software. This effort can be measured inreusable lines of code (LOC), LOC needed to parameterize a service for a specific infrastructure,development time, and ease of use. The minimal software implementation is roughly 2,000 lines ofcode in Java sans DL properties initialization files, instance-specific parameterization of reports,statistics and metrics, DB connections, curation policies, and DB setup. User studies are not apart of this evaluation. See Fig. 7.5 for the LOC effort and reductions related to the integration ofthe SimDL toolkit instead of reimplementation. These estimates are for SimDL development andintegration LOC only. Roughly ten high-level statements are needed to invoke each service from anexternal component. The parameterization is an estimate based on the minimal number of reportsand metrics for standard DO files described in the preceding chapter, see Fig. 6.13. Depending onthe exhaustiveness of the desired incentivizing and curating functions, the actual requirements willbe several times this amount.

Jonathan P. Leidig Chapter 7. Evaluation 102T

able

7.4:

Eva

luat

ion

ofth

eD

Lse

rvic

esw

ithtim

ere

solu

tion

inm

illis

econ

ds(w

ithsm

all,

med

ium

,and

larg

ein

dexe

s).

Serv

ice

Ove

rall

Inpu

tD

atab

ase

orPr

oces

sing

Inte

rnal

mem

ory

Com

plex

ity-O{f

(x)}

(ind

exsi

ze)

cost

proc

essi

ngm

etad

ata

cost

cost

boun

d-O{f

(x)}

boun

dD

LC

urat

e(s

)22

0.6

028

.219

2.3

rcn

rcn

DL

Cur

ate

(m)

257

030

.322

6.3

DL

Cur

ate

(l)

460.

50

236

224.

5D

LC

urat

eDO

Ran

ker(

s)22

3.6

.126

.419

7n

cnD

LC

urat

eDO

Ran

ker(

m)

236

029

.320

6.4

DL

Cur

ateD

OR

anke

r(l)

484.

7.1

241.

424

3.2

DL

Filte

r(s)

229.

1.2

7.8

.1n

n+c

DL

Filte

r(m

)20

6.6

011

.90

DL

Filte

r(l)

331.

8.2

133.

3.1

DL

Get

DL

ID22

4.6

013

.28.

71

1D

LG

etM

etad

ata

(s)

210.

30

7.1

.81

nD

LG

etM

etad

ata

(m)

209.

70

6.9

1.1

DL

Get

Met

adat

a(l

)25

40

7.2

38.3

DL

Ince

ntiv

ize

(s)

135

7355

0n

rnD

LIn

cent

iviz

e(m

)13

5.9

73.2

55.5

.2D

LIn

cent

iviz

e(l

)47

475

.538

9.4

.3D

LM

emoi

zatio

n(s

)22

6.9

83.5

10.9

123.

3n

n+c

DL

Mem

oiza

tion

(m)

260.

464

.511

.417

8.1

DL

Mem

oiza

tion

(l)

450.

465

.911

525

1.1

DL

Prov

Reg

iste

r22

6.27

01.

550

11

DL

Prov

Res

olve

(s)

208.

90

7.4

0n

nD

LPr

ovR

esol

ve(m

)20

4.3

07.

30

DL

Prov

Res

olve

(l)

246.

50

45.4

0.1

DL

Reg

iste

r26

3.5

014

91

1D

LSe

arch

(s)

104.

40.

121

.50

ncn

DL

Sear

ch(m

)12

9.4

019

.30

DL

Sear

ch(l

)37

4.1

25.8

116.

8


Table 7.5: Service deployment efforts for the CINET case study.Service LOC to Parameterize LOC Reused for Developers

DLCurate (s) 250 559DLCurateDORanker 20 410

DLFilter 0 156DLGetDLID 0 117

DLGetMetadata 0 135DLIncentivize 88 221

DLMemoization 0 87DLProvRegister 0 124DLProvResolve 0 119

DLRegister 0 154DLSearch 0 170

Chapter 8

Conclusions and Future Work

8.1 Summary

Our efforts in the SimDL framework aimed to introduce a new generation of digital librariesthat directly connect researcher efforts with simulation environments. The SimDL model allowsresearchers to take advantage of large existing bases of experimentation by archiving previous workfor shared use among collaborators. SimDL may play a key role in the coordinated data manage-ment efforts of large cyberinfrastructure environments, particularly those based on computationalepidemiology and network science. Future uses of SimDL may expand into general simulationenvironments.

In this work, we have

• covered the digital library requirements of the computational epidemiology and networkscience communities;

• generated the SimDL framework for developing and implementing a class digital libraries;

• formalized SimDL as a digital library management system, in the 5S framework, and showedhow to translate definitions into alternatives (i.e., DELOS Reference Model);

• produced a digital library system software toolkit from the DLMS;

• deployed multiple digital library instances from the DLS;

• proposed and studied important considerations and an evaluation for digital library case studydeployment;

• demonstrated that this process may be utilized to generate software toolkits with reasonableservices; and

• shown that this process is capable of supporting simulation-related infrastructures efficiently.

104

Jonathan P. Leidig Chapter 8. Conclusions and Future Work 105

8.2 Hypothesis Review

Looking back at what was accomplished, we have proven both of the hypotheses. From chapters 3and 4, we deduce that it is possible to describe the current and anticipated needs of the simulation-based research community and formally define relevant content, users, tasks, digital library services,and digital library software. This was demonstrated through utilization of the 5S framework. Wehave shown that the DLMS and implemented DLS toolkit led to efficient (semi) automated servicesin support of the defined requirements. Additionally, we described what constitutes scientific digitallibrary artifacts, previously unavailable SimDL services, and evaluations of SimDL performance.

8.3 Future Work

Building on this effort, we can generalize the SimDL framework to a broader science-supportingdigital library with appropriately defined minimal functionality. Additional performance, casestudy, and user study evaluations may answer the following questions.

• How effectively can epidemiology simulation research groups not involved in developing theprototypes make use of SimDL?

• How does SimDL and its services improve efficiency in comparison to existing systems?

• What are the performance gains when comparing designing and running experiments withand without SimDL services?

• How well can educators carry out educational activities with managed content, digital libraries,and simulation infrastructures?

• Are there alternative, standard methodologies to running experiments in simulation groups(other than study designs, results, analyses, data mining, and publication)?

• How well does SimDL support this from end-to-end, based on the effort required to create astudy and execute an experiment?

• What additional functions may be provided through SimDL from emerging workflow, digitallibrary, and management systems?

• How suitable is SimDL outside of epidemiological and network science domains?

Formal definitions for content, services, and users have been completed along with many serviceimplementations. Deployments of SimDL have been implemented for managing individual sim-ulation models and cyberinfrastructure. Implementing functionality for extended (non-minimal)services will remain an ongoing development task, and integration requires tight coupling throughassumed brokers. Services are currently being organized into a registry for discovery and reuse,albeit in a DL community without a history of registry utilization.


8.3.1 Functional Extensions

Functional extensions and research efforts to improve SimDL are in continual development. Whilethe intention is to provide SimDL as an extensible platform for deploying digital libraries to managescientific content, several components are required to promote adoption within the epidemiology,simulation, digital library, and larger scientific communities. These components may be added tothe software toolkit using the framework to produce an extended version of the minimal SimDLby adding general implementations of further services specific to the epidemiology and networkscience domain. A next-generation set of components is needed to annotate and recommenddigital objects; ask high level semantic search queries; query across SimDL instances; federateSimDL instances with other digital libraries through semantic mapping; provide interoperabilityfor accessing metadata and collections with other digital libraries; and provide a method forcommunication within the DL system (e.g., message boards). Visualization components maybe added to the paired UI to ease input configuration navigation; dragging and dropping sub-configurations; fisheye overviews of a schema; and treemap views of datasets, analyses, documents,and annotations collections. Maintaining user sessions across SimDL instances may provide morecoherent use of user credentials, demographics, access rights, preferences, tailored views, expertise,classes of rights (access to content), and roles (access to system functionalities). However, SimDLis not initially envisioned as a platform for semantically combining all worldwide simulationsystems. The foremost expectations for a SimDL instance are to holistically provide DL servicesfor interested simulation groups within a domain under multiple contexts.

This framework may be extended with high-level language processing, translating between ques-tions and study designs, automated model selection and parameterization, cloud-based searchingimplementations, and DL service computation offloading to HPC resources. Additional curationactivities may be added by simply parameterizing the curation service. Obvious extensions includetranslating to new file and metadata formats as a part of migration, archiving older software codes,and emulating operating systems. Scientific content migration may be automated and added to thecuration policy triggers and processes. In addition, content regeneration and study reproductionis possible through the existing provenance links and curation services. The provenance trackingservice is able to link successive DOs in an arbitrary workflow. This concept may be furtherfleshed out with a formalization of tracking provenance in dynamic workflows. For the scientificcommunity, a large number of science-supporting services could be added to SimDL. As examples,linear extrapolation of simulation results could be performed for some models based on existing data,and multivariate query-by-sketches could apply data mining to existing collections as demonstratedin [13]. Additional work is needed in defining domain-specific appropriate methods for comparingcomplex experiment sets in ranked searches. Note that the current search functions allow for theapplication of complex, weighted queries and may be utilized to conduct experiment comparisons.Novice experimenters lack detailed knowledge of simulation models, prior studies, and how toproperly design studies. A service to automate model selecting and parameterization based onhigh-level statements and queries would greatly benefit this class of users. Additional evaluationson similarity and distance measures would provide insights into alternative discovery functions,


i.e., Cosine, Dice, Euclidean, Forbes, Russel-Rao, Yule, Simpson, Manhattan, Tvensky, Kendall’s(K) tau, and similarity analysis. While the SimDL framework and software toolkit have beenused successfully in several case studies, we wish to see further adoption and extensions by othercomputational epidemiology, simulation-research, and DL research groups.

8.3.2 Cloud, Grid, and Cluster Supporting DL Services

Distributed storage and retrieval implementations have yet to fully utilize available cloud, grid,and cluster infrastructures. This framework is especially suited to such research areas due to thedirect availability of HPC platforms to SimDL instances. Future work in this framework wouldbe disappointing without expansion into and advance in scalable and distributed DL computing.The storage, indexing, and discovery of content is a primary concern in digital libraries. As thenumber of simulation experiments to be stored increases, the storage size will necessarily increase.As additional computational power is added and simulation software is extended, it is likely that theinputs and results of simulation will increase as experiments become more complex. Over time, theinput space of applications will be explored as directed by the usage of researchers. With novelfunctionality and attractive services, widespread adoption and heavy usage of SimDL at a particularresearch group would generate massive amounts of experimental data. Deterministic discrete-eventsimulation software employed by SimDL produces a predictably sized output depending on theinput files. Should a non-terminating simulation software attempt to store results in SimDL, extraconsideration would be required to ensure that the time steps required for the simulation to reacha steady-state are acceptable, see [64]. When storing simulation generated output in particular,the method of automated data collection is a prime concern. There is not a clear standard for thedelivery method of simulation results. The interface to the simulation software (e.g., Simdemics)is a key element in storing the simulated data. It is necessary for the information provided to andproduced by the simulation software to be matched directly with the digital library storage scheme.In addition, a given scenario may be replicated numerous times with different random seeds orinput. Awareness of the considerable storage requirements for a set of replications may be requiredwhen launched by researchers. This may lead to compressed data formats for new content andlinks to similar input fragments across experiments. The framework described in this work may beutilized in its own right and provides a stable platform for investigating these possibilities.

Bibliography

[1] William L. Anderson. Some Challenges and Issues in Managing, and Preserving Accessto, Long Lived Collections of Digital Scientific and Technical Data. Data Science Journal,3:191--201, 2004.

[2] Steve Androulakis, Ashley M. Buckle, Ian Atkinson, David Groenewegen, Nick Nicholas,Andrew Treloar, and Anthony Beitz. ARCHER - e-Research Tools for Research DataManagement. International Journal of Digital Curation, 4(1):22--33, 2009.

[3] Massimiliano Assante, Leonardo Candela, Donatella Castelli, Luca Frosini, Lucio Lelii,Paolo Manghi, Andrea Manzi, Pasquale Pagano, and Manuele Simi. An extensible virtualdigital libraries generator. In ECDL ’08, pages 122--134, 2008.

[4] Karla Atkins, Keith Bisset, Stephen Eubank, Xizhou Feng, and Madhav Marathe. SimulatedPandemic Influenza Outbreaks in Chicago. Technical Report NDSSL-07-004, Virginia Tech,2004.

[5] Jerome Avondo. DMBI: Data Management for Bio-Imaging. Final JISC Report, pages1--44, 2011.

[6] Chris Barrett, Keith Bisset, Jonathan Leidig, Achla Marathe, and Madhav Marathe. Economicand social impact of influenza mitigation strategies by demographic class. Epidemics Journal,3:19--31, 2011.

[7] Chris Barrett, Keith Bisset, Jonathan P. Leidig, Achla Marathe, and Madhav Marathe.Estimating the impact of public and private strategies for controlling and Epidemic: Amulti-agent approach. In Twenty-First Innovative Applications of Artificial IntelligenceConference, pages 34--39, 2009.

[8] Christopher Barrett, Keith Bisset, Jonathan Leidig, Achla Marathe, and Madhav V. Marathe.An Integrated Modeling Environment to Study the Co-evolution of Networks, IndividualBehavior and Epidemics. AI Magazine, 31(1):75--87, 2010.

[9] Christopher L. Barrett, Keith R. Bisset, Stephen G. Eubank, Xizhou Feng, and MadhavMarathe. EpiSimdemics: an efficient algorithm for simulating the spread of infectious

108


disease over large realistic social networks. In SC ’08: Proceedings of the 2008 ACM/IEEEconference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.

[10] Neil Beagrie. National Digital Preservation Initiatives: An Overview of Developments inAustralia, France, the Netherlands, and the United Kingdom and of Related InternationalActivity. Library of Congress, Washington, DC, 2003.

[11] Neil Beagrie, Julia Chruszcz, and Brian Lavoie. Keeping Research Data Safe: A Cost Modeland Guidance for UK Universities. JISC, London, 2008.

[12] Robert E. Belford, John W. Moore, and Harry E. Pence. Enhancing Learning with Online Re-sources, Social Networking, and Digital Libraries. American Chemical Society, Washington,DC, 2010.

[13] Jurgen Bernard, Tobias Ruppert, Maximilian Scherer, Jorn Kohlhammer, and Tobias Schreck.Content-Based Layouts for Exploratory Metadata Search in Scientific Research Data. Pro-ceedings of the Joint Conference on Digital Libraries, JCDL 12, 2012.

[14] Keith Bisset, Jiangzhuo Chen, Xizhou Feng, Anil Vullikanti, and Madhav Marathe. EpiFast:a fast algorithm for large scale realistic epidemic simulations on distributed memory systems.In ICS ’09: Proceedings of the 23rd international conference on Supercomputing, pages430--439, New York, NY, USA, 2009. ACM.

[15] National Science Board. Long-Lived Digital Data Collections: Enabling Research andEducation in the 21st Century. NSF, 2005.

[16] Christine L. Borgman. What are digital libraries? Competing visions. Information Processingand Management, 35(3):227--243, 1999.

[17] Christine L Borgman, Jillian C Wallis, and Noel Enyedy. Little science confronts the datadeluge: habitat ecology, embedded sensor networks, and digital libraries. Int J Digit Libr,7(1):17--30, October 2007.

[18] Christine L. Borgman, Jillian C. Wallis, Matthew S. Mayernik, and Alberto Pepe. Drowningin data: digital library architecture to support scientific use of embedded sensor networks. InProceedings of the 7th annual Joint Conference on Digital Libraries, pages 269--277, 2007.

[19] Mark Brown and Wendy White. Institutional data management blueprint final report.University of Southampton Report, Southampton, UK, 2011.

[20] Douglas Campbell. Identifying the Identifiers. In International Conference Dublin Core andMetadata Applications, pages 74--84, 2007.

[21] Leonardo Candela, Fuat Akal, Henri Avancini, Donatella Castelli, Luigi Fusco, VeronicaGuidetti, Christoph Langguth, Andrea Manzi, Pasquale Pagano, Heiko Schuldt, ManueleSimi, Michael Springmann, and Laura Voicu. DILIGENT: integrating digital library and


Grid technologies for a new Earth observation research infrastructure. Int J Digit Libr,7(1):59--80, October 2007.

[22] Leonardo Candela, Donatella Castelli, Christoph Langguth, Pasquale Pagano, Heiko Schuldt,Manuele Simi, and Laura Voicu. On-Demand Service Deployment and Process Support ine-Science DLs: the DILIGENT Experience. Proceedings of the 10th Workshop of ECDL:Digital Library goes eScience (DLSci06), 2006.

[23] Leonardo Candela, Donatella Castelli, and Pasquale Pagano. D4Science: an e-infrastructurefor supporting virtual research. In Proceedings of IRCDL 2009 - 5th Italian ResearchConference on Digital Libraries, pages 166--169, 2009.

[24] Donatella Castelli, Yannis Ioannidis, Seamus Ross, Hans-Jorg Schek, and Heiko Schuldt.Reference Model for DLMS. DELOS Network of Excellence on Digital Libraries, 2006.

[25] Karthik Channakeshava, Deepti Chafekar, Keith Bisset, Anil Vullikanti, and MadhavMarathe. EpiNet: A simulation framework to study the spread of malware in wirelessnetworks. SIMUTools09, 2009.

[26] Bidyut B. Chaudhuri. Digital Document Processing: Major Directions and Recent Advances.Springer, New York, 2007.

[27] Melissa H. Cragin, Carole L. Palmer, Jacob R. Carlson, and Michael Witt. Data sharing,small science and institutional repositories. Philosophical Transactions of the Royal SocietyA: Mathematical, Physical and Engineering Sciences, 368(1926):4023--4038, 2010.

[28] Donald E. Curtis, Christopher S. Hlady, Sriram V. Pemmaraju, Philip M. Polgreen, andAlberto M. Segre. Modeling and estimating the spatial distribution of healthcare workers. InProceedings of the 1st ACM International Health Informatics Symposium, IHI ’10, pages287--296, New York, NY, USA, 2010. ACM.

[29] Mark Dahl, Kyle Banerjee, and Michael Spalti. Digital Libraries: Integrating Content andSystems. Chandos, Oxford, 2006.

[30] Pete Dalton, Stella Thebridge, and Rebecca Hartland-Fox. Evaluating Electronic InformationServices. In Judith Andrews and Derek Law, editors, Digital Libraries: Policy, Planning,and Practice, pages 113--127. Ashgate, Burlington, 2004.

[31] Thomas H. Davenport and Laurence Prusak. Information Ecology. Oxford Univ Press, NewYork, 1997.

[32] James R. Davis and Carl Lagoze. NCSTRL: Design and Deployment of a Globally DistributedDigital Library. Journal of the American Society for Information Science, 1999.


[33] Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Scott Koranda, Albert Lazzarini,Gaurang Mehta, Maria Alessandra Papa, and Karan Vahi. Pegasus and the Pulsar Search:From Metadata to Execution on the Grid. In Applications Grid Workshop at the FifthInternational Conference on Parallel Processing and Applied Mathematics (PPAM), pages821--830, Czestochowa, Poland, 2003.

[34] Leslie M. Delserone. At the watershed: Preparing for research data management andstewardship at the university of minnesota libraries. Library Trends, 57(2):202--210, 2008.

[35] Chris Dodd and Judith Andrews. The Development of UCEEL: A Digital Library for theUniversity of Central England. In Judith Andrews and Derek Law, editors, Digital Libraries:Policy, Planning, and Practice, pages 155--166. Ashgate, Burlington, 2004.

[36] Martin Donnelly, Sarah Jones, and John Pattenden-Fail. DMP online: A demonstration ofthe digital curation centreâ Web-Based tool for creating, maintaining and exporting datamanagement plans. In Mounia Lalmas, Joemon Jose, Andreas Rauber, Fabrizio Sebastiani,and Ingo Frommholz, editors, Research and Advanced Technology for Digital Libraries,volume 6273 of Lecture Notes in Computer Science, pages 530--533. Springer Berlin /Heidelberg, 2010.

[37] Martin Donnelly and Robin North. The Milieu and the MESSAGE: Talking to Researchersabout Data Curation Issues in a Large and Diverse E-science Project. International Journalof Digital Curation, 1(6), 2011.

[38] Rocky Dunlap, Leo Mark, Spencer Rugaber, Venkatramani Balaji, Julien Chastang, LucaCinquini, Cecelia DeLuca, Don Middleton, and Sylvia Murphy. Earth system curator:metadata infrastructure for climate modeling. Earth Science Informatics, 1:131--149, 2008.

[39] Thierry Oscar Edoh and Gunnar Teege. EPharmacyNet: an approach to improve thepharmaceutical care delivery in developing countries-study case-BENIN. In Proceedings ofthe 1st ACM International Health Informatics Symposium, IHI ’10, pages 859--863, NewYork, NY, USA, 2010. ACM.

[40] Tracey Erwin, Julie Sweetkind-Singer, and Mary Lynette Larsgaard. The National GeospatialDigital Archivesâ Collection Development: Lessons Learned. Library Trends, 57(3):490--515, 2009.

[41] John L. Faundeen. The Challenge of Archiving and Preserving Remotely Sensed Data. DataScience Journal, 2:159--163, 2003.

[42] Kathleen Fear. ‘You Made It, You Take Care of It’: Data Management as Personal Informa-tion Management. International Journal of Digital Curation, 2(6), 2011.

[43] Edward A. Fox. The 5S Framework for Digital Libraries and Two Case Studies: NDLTDand CSTC. In Proceedings NIT’99, Taipei, Taiwan, Aug. 18-20, 1999.


[44] Edward A. Fox, Marcos Andre Gonçalves, and Rao Shen, editors. Theoretical Foundationsfor Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach.Morgan and Claypool Publishers, 2012.

[45] Edward A. Fox, Seungwon Yang, and Seonho Kim. ETDs, NDLTD, and Open Access: A5S Perspective. Ciencia da Informacao, 35(2), 2006.

[46] Norbert Fuhr, Preben Hansen, Michael Mabe, Andras Micsik, and Ingeborg Solvberg. DigitalLibraries: A Generic Classification and Evaluation Scheme. In ECDL 01, 2001.

[47] Tracy Gabridge. The Last Mile: Liaison Roles in Curating Science and Engineering ResearchData. Research Library Issues: A Bimonthly Report from ARL, CNI, and SPARC, 265:15--21,2009.

[48] Sujatha Das Gollapalli, Prasenjit Mitra, and C. Lee Giles. Similar researcher search inacademic environments. In Proceedings of the 12th ACM/IEEE-CS joint conference onDigital Libraries, JCDL ’12, pages 167--170, New York, NY, USA, 2012. ACM.

[49] Marcos André Gonçalves and Edward A. Fox. 5SLgen: A Tool for Semi-Automatic,Declarative Generation of Tailored Digital Libraries, 2002.

[50] Marcos André Gonçalves. Streams, structures, spaces, scenarios, societies (5S): A formaldigital library framework and its applications. PhD thesis, Virginia Tech, Blacksburg, VA,USA, 2004.

[51] Marcos André Gonçalves and Edward A. Fox. 5SL: a language for declarative specificationand generation of digital libraries. In Proceedings of the 2nd ACM/IEEE-CS Joint Conferenceon Digital Libraries, JCDL ’02, pages 263--272, New York, NY, USA, 2002. ACM.

[52] Marcos André Gonçalves, Edward A. Fox, and Layne T. Watson. Towards a digital librarytheory: a formal digital library ontology. Int J Digit Libr, 8:91--114, April 2008.

[53] Marcos André Gonçalves, Edward A. Fox, Layne T. Watson, and Neill A. Kipp. Streams,structures, spaces, scenarios, societies (5S): A formal model for digital libraries. ACM TransInf Syst, 22(2):270--312, 2004.

[54] Douglas Gorton. Practical Digital Library Generation into DSpace with the 5S Framework.Master’s thesis, Virginia Tech, Blacksburg, VA, USA, 2007.

[55] Jim Gray, Alexander S. Szalay, Ani Thakar, Christopher Stoughton, and Jan VandenBerg.Online scientific data curation, publication, and archiving. In CoRR, 2002.

[56] Norman Gray. Managing Research Data - Gravitational Waves. JISC Final Report, pages1--5, 2011.


[57] Jane Greenberg, Hollie C. White, Sarah Carrier, and Ryan Scherle. A Metadata Best Practicefor a Scientific Data Repository. In Jung ran Park, editor, Metadata Best Practices andGuidelines: Current Implementation and Future Trends, pages 194--212. Routledge, London,2012.

[58] Daniel Greenstein and Suzanne E. Thorin. The Digital Library: A Biography. Digital LibraryFederation, Washington, 2002.

[59] Greenstone. Greenstone Digital Library Software homepage, 2011. http://www.greenstone.org/ [last visited July 4, 2012].

[60] Alicia F. Guidry, Judd L. Walson, and Neil F. Abernethy. Linking information systemsfor HIV care and research in Kenya. In Proceedings of the 1st ACM International HealthInformatics Symposium, IHI ’10, pages 531--535, New York, NY, USA, 2010. ACM.

[61] Zhiyong Guo, Liangying Ma, and Huiyun Sun. A Cooperative Service Model for DigitalLibrary Alliances Based on Grid. In Machine Vision and Human-Machine Interface (MVHI),pages 483--486, April 2010.

[62] Myron P. Gutmann, Mark Abrahamson, Margaret O. Adams, Micah Altman, Caroline R.Arms, Kenneth Bollen, Michael Carlson, Jonathan Crabtree, Darrell Donakowski, GaryKing, Jared Lyle, Marc Maynard, Amy Pienta, Richard Rockwell, Lois Timms-Ferrara, andCopeland H. Young. From preserving the past to preserving the future: The Data-PASSproject and the challenges of preserving digital social science data, 2009.

[63] Beda Christoph Hammerschmidt, Martin Kempa, and Volker Linnemann. Autonomous IndexOptimization in XML Databases. In 21st International Conference on Data EngineeringWorkshops, 2005, page 1217, 2005.

[64] Charles Harrell. Simulation Using Promodel. McGraw-Hill, 2011.

[65] Ross Harvey. Digital Curation: A how-to-do-it Manual, volume 170 of How to Do ItManuals for Librarians. Neal-Schuman, New York, 2010.

[66] David C. Hay. Data Model Patterns: A Metadata Map. Morgan Kaufmann. Elsevier,Amsterdam, 2006.

[67] P. Bryan Heidorn. Shedding Light on the Dark Data in the Long Tail of Science. LibraryTrends, 2(57):280--299, 2008.

[68] P. Bryan Heidorn. The Emerging Role of Libraries in Data Curation and E-science. Journalof Library Administration, 7(51):662--672, 2011.

[69] Peter Hernon and Ellen Altman. Assessing Service Quality: Satisfying the Expectations ofLibrary Customers. American Library Association, Chicago, 2010.


[70] Tony Hey, Stewart Tansley, and Kristin Tolle. The Fourth Paradigm: Data-IntensiveScientific Discovery. Microsoft Research, Redmond, Wash, 2009.

[71] Tony Hey and Anne Trefethen. The data deluge: An e-science perspective. In GridComputing - Making the Global Infrastructure a Reality, pages 809--824. Wiley and Sons,2003.

[72] Doug Howe, Maria Costanzo, etra Fey, Takashi Gojobori, Linda Hannick, Winston Hide,David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White,and Seung Yon Rhee. Future of Biocuration. Nature, 455:47--50, 2008.

[73] Patricia Hswe and Ann Holt. Joining in the Enterprise of Response in the Wake of the NSFData Management Planning Requirement. Research Library Issues, 274:11--17, 2011.

[74] Gregory S. Hunter. Preserving Digital Information: A How To-Do-It Manual, volume 93 ofHow to Do It Manual for Librarians. Neal Schuman, New York, London, 2000.

[75] Yannis Ioannidis. User Working Group: Towards User Interoperability. In EuropeanConference on Research and Advanced Technology for Digital Libraries First DL.orgWorkshop presentation, 2009.

[76] Mike Jackson and Gabriel Bodard. Supporting Productive Queries for Research (SPQR):the Semantic Web and (KCL) Ancient Datasets. Classical Association Annual Conference,2011.

[77] Nicholas Joint. Data Preservation, the New Science and the Practitioner Librarian. LibraryReview, 6(56):451--455, 2007.

[78] Thomas H. Jordan. SCEC 2009 Annual Report. Southern California Earthquake Center,2009.

[79] Diane Kaplan. The stanley milgram papers: A case study on appraisal of and access toconfidential data files. American Archivist, 59(3):288--297, 1996.

[80] Helena Karasti and Karen S. Baker. Digital data practices and the long term ecologicalresearch program growing global. International Journal of Digital Curation, 3(2):42--58,2008.

[81] Ulla Bogvad Kejser, Anders Bo Nielsen, and Alex Thirifays. Cost model for digital preser-vation: Cost of digital migration. International Journal of Digital Curation, 6(1), 2011.

[82] Stefanie Kethers, Xiaobin Shen, Andrew E. Treloar, and Ross G. Wilkinson. DiscoveringAustralia’s research data. In Proceedings of the 10th annual Joint Conference on DigitalLibraries, JCDL ’10, pages 345--348, New York, NY, USA, 2010. ACM.

[83] Yunhyong Kim. Methodology for Curation and Preservation Experiments. Technical report4.26, Digital Curation Center, 2008.


[84] Marisa L. Whose Role Is It Anyway?: A Library Practitioner’s Appraisal of the DigitalData Deluge. Bulletin of the American Society for Information Science & Technology,5(37):21--23, 2011.

[85] Kathryn Lage, Barbara Losoff, and Jack Maness. Receptivity to Library Involvement inScientific Data Curation: A Case Study at the University of Colorado Boulder. Librariesand the Academy, 4(11):915--937, 2011.

[86] Carl Lagoze, William Arms, Stoney Gan, Diane Hillmann, Christopher Ingram, Dean Krafft,Richard Marisa, Jon Phipps, John Saylor, Carol Terrizzi, Walter Hoehn, David Millman,James Allan, Sergio Guzman-Lara, and Tom Kalt. Core services in the architecture of theNational Science Digital Library (NSDL). In JCDL’02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, Houston, TX, pages 201--209, 2002.

[87] Jonathan Leidig, Edward Fox, Madhav Marathe, and Henning Mortveit. Epidemiology ex-periment and simulation management through schema-based digital libraries. In Proceedingsof the 2nd DL.org Workshop at ECDL, pages 57--66, 2010.

[88] Jonathan Leidig, Edward A. Fox, Kevin Hall, Madhav Marathe, and Henning Mortveit.SimDL: A Model Ontology Driven Digital Library for Simulation Systems. In ACM/IEEEJoint Conference on Digital Libraries, JCDL ’11. ACM, 2011.

[89] Jonathan Leidig, Madhav Marathe, and Edward A. Fox. Scientific Data Management throughDigital Libraries. In NIH MIDAS Meeting, 2010.

[90] Jonathan P. Leidig. Connecting public health practitioners, policy makers, organizations,and scientists through a simulation infrastructure. In mHealth Summit, 2011.

[91] Jonathan P. Leidig. Managing Data in Medical and Bio-Informatics. In Southwest VirginiaLife Science Forum, 2011.

[92] Jonathan P. Leidig. Digital Library Support for Public Health Simulation Infratructures.TCDL Bulletin, 8(1), 2012.

[93] Jonathan P. Leidig, Chris Barrett, Keith Bisset, Achla Marathe, and Madhav Marathe.Economic and Social Impacts of Mitigation Strategies. In NIH MIDAS Meeting, 2011.

[94] Jonathan P. Leidig, Chris L. Barrett, and Madhav Marathe. Guiding Health Care PolicyThrough Applied Public Health Modeling and Simulation. In International Conference onHealth Information Technology Advancement, ICHITA, 2011.

[95] Jonathan P. Leidig, Edward A. Fox, and Madhav Marathe. Simulation Tools for ProducingMetadata Description Sets Covering Simulation-Based Content Collections. InternationalConference on Modeling, Simulation, and Identification, 755(045), 2011.


[96] Na Li, Leilei Zhu, Prasenjit Mitra, Karl Mueller, Eric Poweleit, and C. Lee Giles. oreChemChemXSeer: a semantic digital library for chemistry. In Proceedings of the 10th annualJoint Conference on Digital Libraries, JCDL ’10, pages 245--254, New York, NY, USA,2010. ACM.

[97] Jiming Liu and Shang Xia. Effective epidemic control via strategic vaccine deployment:a systematic approach. In Proceedings of the 1st ACM International Health InformaticsSymposium, IHI ’10, pages 91--99, New York, NY, USA, 2010. ACM.

[98] Raymond A. Lorie. Long term preservation of digital information. In Proceedings of the1st ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’01, pages 346--352, NewYork, NY, USA, 2001. ACM.

[99] Bertram Ludascher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones,Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the KeplerSystem. In Concurr Comput Pract Exper, page 2006, 2005.

[100] Clifford Lynch and Hector Garcia-Molina. NDLTD: Preparing the Next Generation ofScholars for the Information Age. The New Review of Information Networking (NRIN), 1995.

[101] Stuart Macdonald and Luis Martinez-Uribe. Collaboration to Data Curation: HarnessingInstitutional Expertise. New Review of Academic Librarianship, 1(16):4--16, 2010.

[102] Luis Martinez-Uribe and Stuart Macdonald. User Engagement in Research Data Curation.Lecture Notes in Computer Science, 5714:309--314, 2009.

[103] Gail McMillan. Digital libraries support distributed education or put the library in digitallibrary. 9th National Conference of the Association of College and Research Libraries, 1999.

[104] Huang Meng-xing, Xing Chun-xiao, and Yang Ji-jiang. A Cooperative Framework of ServiceChain for Digital Library. In Computer Software and Applications Conference COMPSAC

’09, volume 2, pages 359 --364, July 2009.

[105] Steven J. Miller. Metadata for Digital Collections: A how-to-do-it Manual, volume 179 ofHow to Do It Manuals for Librarians. Neal-Schuman, New York, 2011.

[106] MIT. Dspace: Durable digital depository, 2003.

[107] Reagan W. Moore, Arcot Rajasekar, Michael Wan, Yannis Katsis, Dayou Zhou, AlinDeutsch, and Yannis Papakonstantinou. Constraint-based Knowledge Systems for Grids,Digital Libraries, and Persistent Archives: Yearly Report. In SDSC TR-2005-5, 2005.

[108] Luc Moreau, Beth Plale, Simon Miles, Carole Goble, Paolo Missier, Roger Barga, YogeshSimmhan, Joe Futrelle, Robert McGrath, Jim Myers, Patrick Paulson, Shawn Bowers,Bertram Ludaescher, Natalia Kwasnikowska, Jan Van den Bussche, Tommy Ellkvist, JulianaFreire, and Paul Groth. The Open Provenance Model (v1.01). In Technical Report, 2008.


[109] Bárbara Moreira, Marcos Gonçalves, Alberto Laender, and Edward Fox. Evaluating DigitalLibraries with 5SQual. Research and Advanced Technology for Digital Libraries, ECDL,pages 466--470, 2007.

[110] Uma Murthy, Edward A. Fox, Yinlin Chen, Eric Hallerman, Ricardo Da Silva Torres,Evandro J. Ramos, and Tiago R. C. Falco. Superimposed image description and retrieval forfish species identification. In Proceedings of the 13th European conference on Research andadvanced technology for digital libraries, ECDL’09, pages 285--296, Berlin, Heidelberg,2009. Springer-Verlag.

[111] Uma Murthy, Doug Gorton, Ricardo Torres, Marcos Goncalves, Edward A. Fox, and LoisDelcambre. Extending the 5S Digital Library (DL) Framework: From a Minimal DL towardsa DL Reference Model. In First Digital Library Foundations Workshop, JCDL ’07, NewYork, NY, USA, 2007. ACM.

[112] Research Information Network. Patterns of information use and exchange: case studies ofresearchers in the life sciences. British Library, pages 1--56, 2009.

[113] Gregory M. Nielson, Hans Hagen, and Heinrich Muller. Scientific Visualization: Overviews,Methodologies, and Techniques. IEEE Computer Society Press, Los Alamitos, California,1997.

[114] National Virtual Observatory. Towards the National Virtual Observatory. SDT Report, pages1--73, 2002.

[115] Carole L. Palmer, Bryan P. Heidorn, Dan Wright, and Melissa H. Cragin. Graduate cur-riculum for biological information specialists: A key to integration of scale in biology.International Journal of Digital Curation, 2(2):31--40, February 2008.

[116] Yi Pan and Laurence Yang. Parallel and Distributed Scientific and Engineering Computing:Practice and Experience, volume 15 of Advances in Computation: Theory and Practice.Nova Science Publishers, New York, 2004.

[117] Sung Park, Jonathan Leidig, Lin Li, Edward Fox, Nathan Short, Kevin Hoyle, Lynn Abbott,and Michael Hsiao. Experiment and Analysis Services in a Fingerprint Digital Library forCollaborative Research. In Stefan Gradmann, Francesca Borri, Carlo Meghini, and HeikoSchuldt, editors, Research and Advanced Technology for Digital Libraries, volume 6966 ofLecture Notes in Computer Science, pages 179--191. Springer Berlin / Heidelberg, 2011.

[118] Manjula Patel and Alexander Ball. Challenges and issues relating to the use of representationinformation for the digital curation of crystallography and engineering data. InternationalJournal of Digital Curation, 3(1):76--88, 2008.

[119] Sandra Payette and Carl Lagoze. Flexible and Extensible Digital Object and RepositoryArchitecture (FEDORA). In ECDL ’98: Proceedings of the Second European Conference on


Research and Advanced Technology for Digital Libraries, pages 41--59, London, UK, 1998.Springer-Verlag.

[120] Christie Peters and Anita Riley Dryden. Assessing the academic library’s role in Campus-Wide research data management: A first step at the university of houston. Science &Technology Libraries, 30(4):387--403, 2011.

[121] Meik Poschen, June Finch, Rob Procter, Mhorag Goff, Mary McDerby, Simon Collins, JonBesson, Lorraine Beard, and Tom Grahame. Development of a Pilot Data ManagementInfrastructure for Biomedical Researchers at University of Manchester - Approach, Findings,Challenges and Outlook of the MaDAM Project. Intl Digital Curation Conf, 2011.

[122] Hossein Pourreza, Sergio Camorlinga, Carson K.S. Leung, and Bruce D. Martin. Aninformation and communication technology system to support rural healthcare delivery. InProceedings of the 1st ACM International Health Informatics Symposium, IHI ’10, pages440--444, New York, NY, USA, 2010. ACM.

[123] Graham Pryor. Attitudes and aspirations in a diverse world: The project StORe perspectiveon scientific repositories. Intl Journal of Digital Curation, 2(1):135--144, 2008.

[124] Raghu Ramakrishnan and Johannes Gehrke. Database management systems. Computerscience series. McGraw-Hill, 2003.

[125] Shiyali R. Ranganathan. The Five Laws of Library Science. Madras Library Association andEdward Goldston, Madras, India and London, 1931.

[126] Art Rhyno. Using Open Source Systems for Digital Libraries. Libraries Unlimited, Westport,Conn., 2004.

[127] Robin Rice and Jeff Haywood. Research data management initiatives at university ofedinburgh. International Journal of Digital Curation, 6(2):232--244, 2011.

[128] Thomas Risse and Predrag Knezevic. A Self-organizing Data Store for Large Scale Dis-tributed Infrastructures. In 21st International Conference on Data Engineering Workshops,page 1211, 2005.

[129] Hans E. Roosendaal and Peter A.Th.M. Geurts. Forces and Functions in Scientific Commu-nication: and analysis of their interplay. Co-operative Research in Information Systems inPhysics, September 1997.

[130] David S. H. Rosenthal, Thomas Robertson, Tom Lipkis, Vicky Reich, and Seth Morabito.Requirements for Digital Preservation Systems. D-Lib Magazine, 11(11), 2005.

[131] Gerald Salton and Michael E. Lesk. Computer evaluation of indexing and text processing. JACM, 15(1):8--36, January 1968.


[132] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic textretrieval. In Information Processing and Management, pages 513--523, 1988.

[133] Tefko Saracevic. Digital Library Evaluation: Toward Evolution of Concepts. Library Trends,49(2):350--369, 2000.

[134] Jennifer M. Schopf and Steven Newhouse. User priorities for data: Results from SUPER.International Journal of Digital Curation, 2(1):149--155, February 2008.

[135] Hardy Schwamm. FISHNet: Freshwater Information SHaring Network, a new approach tosharing aquatic data. Intl Assoc of Aquatic and Marine Science Libraries and Info CentersConf, 2010.

[136] Rao Shen, Marcos Gonçalves, Weiguo Fan, and Edward A. Fox. Requirements Gathering andModeling of Domain-Specific Digital Libraries with the 5S Framework: An ArchaeologicalCase Study with ETANA. In Andreas Rauber, Stavros Christodoulakis, and A Tjoa, editors,Research and Advanced Technology for Digital Libraries, volume 3652 of Lecture Notes inComputer Science, pages 1--12. Springer Berlin / Heidelberg, 2005.

[137] Thomas Smith, Gerry F. Killeen, Nicolas Maire, Amanda Ross, Louis Molineaux, FabrizioTediosi, Guy Hutton, Jurg Utzinger, Klaus Dietz, and Marcel Tanner. Mathematical mod-eling of the impact of malaria vaccines on the clinical epidemiology and natural history ofPlasmodium falciparum malaria: Overview. In Am J Trop Med Hyg, volume 75, pages 1--10,2006.

[138] Catherine Soehner, Catherine Steeves, and Jennifer Ward. E-Science and data supportservices a study of ARL member institutions. Association of Research Libraries, 2010.

[139] Robert Tansley and Stevan Harnad. Eprints.org software for creating institutional andindividual open archives. D-Lib Magazine, 6(10), October 2000.

[140] Jonathan A. Tedds and David P. Carter. HALOGEN; History Archaeology LinguisticsOnomastics and GENetics: The Roots of the British interdisciplinary research database.DCC, 2010.

[141] Sarah Timm. The Generation and Management of Museum Centered Geologic Materialsand Information. Master’s thesis, Virginia Tech, Blacksburg, VA, USA, 2012.

[142] Eleni Toli, George Athanasopoulos, Anna Nika, and Stephanie Parker. DL.org CoordinationAction on Digital Library Interoperability, Best Practices and Modelling Foundations. DL.orgFirst International Workshop Proceedings, 2009.

[143] Ricardo Torres, Claudia Medeiros, Marcos Gonçalves, and Edward Fox. A digital libraryframework for biodiversity information systems. International Journal on Digital Libraries,6(1):3--17, 2006.


[144] Andrew Treloar. Design and implementation of the australian national data service. Interna-tional Journal of Digital Curation, 4(1):125--137, June 2009.

[145] Andrew Treloar, David Groenewegen, and Cathrine Harboe-Ree. The Data Curation Contin-uum: Managing Data Objects in Institutional Repositories. D-Lib Magazine, 13(9), 2007.

[146] Andrew Treloar and Ross Wilkinson. Access to data for eResearch: designing the australiannational data service discovery services. International Journal of Digital Curation, 3(2):151--158, February 2008.

[147] UC3. UC3 Curation Foundations. Technical report, UC Curation Center, California DigitalLibrary, March 2010.

[148] Kai uwe Sattler, Eike Schallehn, and Ingolf Geist. Towards Indexing Schemes for Self-Tuning DBMS. In 21st International Conference on Data Engineering Workshops, 2005,page 1216, 2005.

[149] Jan Velerop. Keeping the Minutes of Science. Electronic Libraries and Visual InformationResearch (ELVIRA 2), ASLIB, 1995.

[150] Jillian C. Wallis, Christine L. Borgman, Matthew S. Mayernik, and Alberto Pepe. MovingArchival Practices Upstream: An Exploration of the Life Cycle of Ecological Sensing Datain Collaborative Field Research. International Journal of Digital Curation, 3(1), 2008.

[151] Catharine Ward, Lesley Freiman, Sarah Jones, Laura Molloy, and Kellie Snow. Makingsense: Talking data management with researchers. International Journal of Digital Curation,6(2):265--273, July 2011.

[152] Simeon Warner, Jeroen Bekaert, Carl Lagoze, Xiaoming Liu, Sandy Payette, and Herbert Vande Sompel. Pathways: Augmenting interoperability across scholarly repositories. Intl Journalon Digital Libraries special issue on DLs and eScience, page 18, 2006.

[153] Stuart Weibel. The Dublin Core: A Simple Content Description Model for ElectronicResources. Bulletin of the American Society for Information Science and Technology,24(1):9--11, November 1997.

[154] Katherine Wendelsdorf, Josep Bassaganya-Riera, Raquel Hontecillas, and Stephen Eubank.Model of colonic inflammation: Immune modulatory mechanisms in inflammatory boweldisease. Journal of Theoretical Biology, 264(4):1225--1239, 2010.

[155] D. Randall Wilson and Tony R. Martinez. Improved heterogeneous distance functions.Journal of Artificial Intelligence Research, 6:1--34, 1997.

[156] Michael Witt. Institutional Repositories and Research Data Curation in a Distributed Envi-ronment. Library Trends, 57(2):191--201, 2008.


[157] Ian H. Witten and David Bainbridge. How to Build a Digital Library. Morgan Kaufmann,2002.

[158] Antoni Wolski and Bela Hofhauser. A Self-Managing High-Availability Database: IndustrialCase Study. In 21st International Conference on Data Engineering Workshops, 2005, page1210, 2005.

[159] Marcia Lei Zeng, Jaesun Lee, and Allene F. Hayes. Metadata Decisions for Digital Libraries:A Survey Report. In Jung ran Park, editor, Metadata Best Practices and Guidelines: CurrentImplementation and Future Trends, pages 173--193. Routledge, London, 2012.

[160] Jin Zhao, Min-Yen Kan, and Yin Leng Theng. Math information retrieval: user requirementsand prototype implementation. In Proceedings of the 8th annual Joint Conference on DigitalLibraries, JCDL ’08, pages 187--196, New York, NY, USA, 2008. ACM.

[161] Zhao Zhao, Maleq Khan, V.S. Anil Kumar, and Madhav Marathe. Subgraph Enumerationin Large Social Contact Networks using Parallel Color Coding and Streaming. TechnicalReport 10-036, Network Dynamics and Simulation Science Laboratory, VBI, Virginia Tech,2010.

[162] Qinwei Zhu, Marcos André Gonçalves, and Edward A. Fox. 5SGraph demo: a graphicalmodeling tool for digital libraries. In Proceedings of the 3rd ACM/IEEE-CS Joint Conferenceon Digital Libraries, JCDL ’03, pages 385--385, Washington, DC, USA, 2003. IEEEComputer Society.

Documents

Epidemiology Experimentation and Simulation Management ...Epidemiology Experimentation and Simulation Management through Scientific Digital Libraries Jonathan P. Leidig (ABSTRACT)