View
222
Download
0
Embed Size (px)
Citation preview
7/28/2019 CI Vision March07
1/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science FoundationCyberinfrastructure Council
March 2007
7/28/2019 CI Vision March07
2/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science Foundation March 2007
About the Cover
Te visualization on the cover depicts a single cycle odelay measurements made by a CAIDA Internet peror-mance monitor. Te graph was created using the Walrusgraph visualization tool, designed or interactively visual-izing large directed graphs in 3-Dimensional space. Formore inormation: http://www.caida.org/tools/visualiza-tion/walrus/
7/28/2019 CI Vision March07
3/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science Foundation March 2007
Dr. Arden L. Bement, Jr. , Director of Na-tional Science Foundation
- i -
Dear Colleague:
I am pleased to present NSFs CyberinrastructureVision or 21st Century Discovery. Tis document,developed in consultation with the wider science,engineering, and education communities, lays outan evolving vision that will help to guide the Foun-dations uture investments in cyberinrastructure.
At the heart o the cyberinrastructure visionis the development o a cultural community that
supports peer-to-peer collaboration and new modeso education based upon broad and open access toleadership computing; data and inormation re-sources; online instruments and observatories; andvisualization and collaboration services. Cyberinra-structure enables distributed knowledge communi-ties that collaborate and communicate across disci-plines, distances and cultures. Tese research andeducation communities extend beyond traditionalbrick-and-mortar acilities, becoming virtual organi-zations that transcend geographic and institutional
boundaries. Tis vision is new, exciting and bold.
Realizing the cyberinrastructure vision described in this document will requirethe broad participation and collaboration o individuals rom all elds and institu-tions, and across the entire spectrum o education. It will require leveraging resourcesthrough multiple and diverse partnerships among academia, industry and government.An important challenge is to develop the leadership to move the vision orward inanticipation o a comprehensive cyberinrastructure that will strengthen innovation,economic growth and education.
Sincerely,
Arden L. Bement, Jr.Director
Letter From the DireCtor
7/28/2019 CI Vision March07
4/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science Foundation March 2007
- ii -
Te National Science Foundations Cyberinrastructure Council (CIC)1, based on extensive input rom
the research community, has developed a comprehensive vision to guide the Foundations uture invest-
ments in cyberinrastructure (CI). In 2005, our multi-disciplinary, cross-oundational teams were created
and charged with drating a vision or cyberinrastructure in our overlapping and complementary areas:
1) High Perormance Computing, 2) Data, Data Analysis, and Visualization, 3) Cyber Services and Vir-
tual Organizations, and 4) Learning and Workorce Development. Drat versions o the document were
posted on the NSF website and public comments were solicited rom the community. Tese drats were
also reviewed or comment by the National Science Board. Te National Science Foundation thanks all
o those who provided eedback on the Cyberinrastructure Vision or 21st Century Discovery document.
Your comments were careully reviewed and considered during preparation o this version o the docu-
ment, which is intended to be a living document, and will be updated periodically.
We acknowledge the ollowing NSF personnel who served on the strategic planning teams and whose
eorts made this document possible. We especially acknowledge Deborah Craword, who served as acting
director or OCI rom July 2005 to June 2006, and whose leadership was instrumental in the ormulation
o this document.
High Perormance Computing (HPC) CI eam: Deborah Craword (Chair), Leland Jameson, Margaret
Leinen (CIC Representative), Jos Muoz, Stephen Meacham, Michael Plesniak
Data CI eam: Cheryl Eavey, James French, Christopher Greer, David Lightoot (CIC Representative),
Elizabeth Lyons, Fillia Makedon, Daniel Newlon, Nigel Sharp, Sylvia Spengler (Chair)
Virtual Organizations (VO) CI eam: Tomas Baerwald, Elizabeth Blood, Charles Boudin, Arthur
Goldstein, Joy Pauschke (Co-Chair), Randal Ruchti, Bonnie Tompson, Kevin Tompson (Co-Chair),
Michael urner (CIC Representative)
Learning and Workorce Development (LWD) CI eam: James Collins (CIC Representative), JaniceCuny, Semahat Demir, Lloyd Douglas, Debasish Dutta (Chair), Miriam Heller, Sally OConnor, Michael
Smith, Harold Stolberg, Lee Zia
1 Complete list o acronyms can be ound in Appendix A.
ACknowLeDgements
PreFACe
7/28/2019 CI Vision March07
5/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science Foundation March 2007
- iii -
Letter From the DireCtor
PreFACe
ACknowLeDgements
exeCutive summAry 2
1 CALL to ACtion 6
CyberinfrastructureDriversandOpportunities 6
Vision,MissionandPrinciplesforCyberinfrastructure 7
GoalsandStrategies 8
PlanningforCyberinfrastructure 11
2 high PerFormAnCe ComPuting (2006-2010) 14
WhatDoesHighPerformanceComputingOfferScienceandEngineering? 14
TheNextFiveYears:CreatingaHighPerformanceComputingEnvironment
forPetascaleScienceandEngineering 16
3 DAtA, DAtA AnALysis, AnD visuALizAtion (2006-2010) 22
AWealthofScientificOpportunitiesAffordedbyDigitalData 22
Definitions 23
DevelopingaCoherentDataCyberinfrastructureinaComplexGlobalContext 24
TheNextFiveYears:TowardsaNationalDigitalDataFramework 25
4 virtuAL orgAnizAtions For DistributeD Communities (2006-2010) 32
NewFrontiersinScienceandEngineeringThroughNetworkedResourcesand VirtualOrganizations 32
TheNextFiveYears:EstablishingaFlexible,OpenCyberinfrastructure
FrameworkforVirtualOrganizations 33
5 LeArning AnD workForCe DeveLoPment (2006-2010) 38
CyberinfrastructureandLearning 38
BuildingCapacityforCreationandUseofCyberinfrastructure 38
UsingCyberinfrastructuretoEnhanceLearning 40
TheNextFiveYears:LearningAboutandWithCyberinfrastructure 40
APPenDiCes 44
Acronyms 44
RepresentativeReportsandWorkshops 46
ChronologyofNSFInformationTechnologyInvestments 49
ManagementofCyberinfrastructure 50
RepresentativeDistributedResearchCommunities(VirtualOrganizations) 51
imAge CreDits 57
tAbLe oF Contents
I
II
III
IV
I
II
I
II
III
IV
I
II
I
II
III
IV
A
B
C
D
E
7/28/2019 CI Vision March07
6/64
7/28/2019 CI Vision March07
7/64
- 2 -
National Science FoundationCyberinfrastruCture Vision
for 21st CenturyDisCoVeryMarch 2007
NSFs Cyberinrastructure Vision or 21st
Century Discoveryis presented in a set o interre-
lated chapters that describe the various challenges
and opportunities in the complementary areas that
make up cyberinrastructure: computing systems,
data, inormation resources, networking, digitally
enabled-sensors, instruments, virtual organiza-
tions, and observatories, along with an interop-
erable suite o sotware services and tools. Tis
technology is complemented by the interdisciplin-
ary teams o proessionals that are responsible or
its development, deployment and its use in trans-
ormative approaches to scientic and engineering
discovery and learning. Te vision also includes
attention to the educational and workorce initia-
tives necessary or both the creation and eective
use o cyberinrastructure.
Te ve chapters o this document set out
NSFs cyberinrastructure vision. Te rst,A Call
or Action, presents NSFs vision and commit-
ment to a cyberinrastructure initiative. NSF will
play a leadership role in the development and
support o a comprehensive cyberinrastructure
essential to 21st century advances in science and
engineering research and education. Te vision o-cuses on a time rame o 2006-2010. Te mission
is or cyberinrastructure to be human-centered,
world-class, supportive o broadened participa-
tion in science and engineering, sustainable, and
stable but extensible. Te guiding principles are
that investments will be science-driven, recognize
the uniqueness o NSFs role, provide or inclusive
strategic planning, enable U.S. leadership in sci-
ence and engineering, promote partnerships and
integration with investments made by others in all
sectors, both national and international, and rely
on strong merit review and on-going assessment,and a collaborative governance culture. Tis chap-
ter goes on to review a set o more specic goals
and strategies or NSFs cyberinrastructure initia-
tive along with brie descriptions o the strategy to
achieve those goals.
High Perormance Computing (HPC) in sup-
port o modeling, simulation, and extraction o
knowledge rom huge data collections is increas-
ingly essential to a broad range o scientic and
engineering disciplines, oten multi-disciplinary
(e.g. physics, biology, medicine, chemistry, cos-
mology, computer science, mathematics), as well
as multi-scalar in dimensions o space (e.g., nano-
meters to light-years) time (e.g., picoseconds1 to
billions o years), and complexity. A vision or
petascale2 science and engineering or the aca-
demic community, enabled by high perormance
computing, is presented along with a series o
principles that would be used to guide NSF sci-
ence-driven HPC investments. Tis would result
in a sustained petascale capable system deployed
in the FY 2010 timerame. Te plan presented
addresses HPC acquisition and deployment and
various aspects o HPC sotware and tools, in
addition to the necessary scalable applications that
would execute on these HPC assets.
An eective computing environment designed
to meet the computational needs o a range o
science and engineering applications will include a
variety o computing systems with complementaryperormance capabilities. NSF will invest in lead-
ership class environments in the 0.5-10 petascale
perormance range. Strong partnerships involv-
ing other ederal agencies, universities, industry
and state government are also critical to success.
NSF will also promote resource sharing between
and among academic institutions to optimize the
accessibility and use o HPC assets deployed and
supported at the campus level. Supporting sot-
ware services include the provision o intelligent
development and problem-solving environments
and tools. Tese tools are designed to provide im-provements in ease o use, reusability o modules,
and portable perormance.
exeCutive summAry
1 A picosecond is 10-12 second2 A petascale is 1015 operations per second with
comparable storage and networking capacity
The image shows computed charge density for iron oxide (FeO) within the local density approximation, withspherical ions subtracted. The colors represent the spin density, showing the antiferromagnetic ordering.
7/28/2019 CI Vision March07
8/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 3 -
National Science Foundation March 2007
Researchers create cyberenvironmentssecure, easy-to-use interfaces to instruments, data, computing systems, networks,applications, analysis and visualization tools, and services.
Data, Data Analysis, and Visualization are
vital or progress in the increasingly data-intensive
realm o science and engineering research and
education. Any cogent plan addressing cyberinra-
structure must address the phenomenal growth o
data in all its various dimensions. Scientists and
engineers are producing, accessing, analyzing, in-
tegrating, storing and retrieving massive amounts
o data daily. Further, this is a trend that is
expected to see signicant growth in the very near
uture as advances in sensors and sensor networks,
high-throughput technologies and instrumenta-
tion, automated data acquisition, computational
modeling and simulation, and other methods and
technologies materialize. Te anticipated growth
in both the production and repurposing o digital
data raises complex issues not only o scale andheterogeneity, but also o stewardship, curation
and long-term access.
Responding to the challenges and opportunities
o a data-intensive world, NSF will pursue a vi-
sion in which science and engineering digital data
are routinely deposited in well-documented orm,
are regularly and easily consulted and analyzed
by specialist and non-specialist alike, are openly
accessible while suitably protected, and are reliably
preserved. o realize this vision, NSFs goals or
2006-2010 are twoold: to catalyze the develop-
ment o a system o science and engineering data
collections that is open, extensible, and evolvable;
and to support development o a new generation
o tools and services or data discovery, integra-
tion, visualization, analysis and preservation. Te
resulting national digital data ramework will be
an integral component in the national cyberin-
rastructure ramework. It will consist o a range
o data collections and managing organizations,
networked together in a exible technical architec-
ture using standard, open protocols and interaces,
and designed to contribute to the emerging global
inormation commons. It will be simultaneously
local, regional, national and global in nature, and
will evolve as science and engineering research and
education needs change and as new science and
engineering opportunities arise.
Virtual Organizations or Distributed Com-
munities, built upon cyberinrastructure, enable
science and engineering communities to pursue
their research and learning goals with dramatically
relaxed constraints o time and distance. A virtual
organization is created by a group o individuals
whose members and resources may be dispersed
geographically and/or temporally, yet who unc-
tion as a coherent unit through the use o end-to-
end cyberinrastructure systems. Tese CI systems
provide shared access to centralized or distributed
7/28/2019 CI Vision March07
9/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 4 -
National Science Foundation March 2007
resources and services, oten in real-time. Such
virtual organizations supporting distributed com-
munities go by numerous names: collaboratory,
co-laboratory, grid community, science gateway,
science portal, and others. As such environments
become more and more unctionally complete
they oer new organizations or discovery and
learning and bold new opportunities or broad-
ened participation in science and engineering.
Creating and sustaining eective virtual
organizations, especially those spanning many
traditional organizations, is a complex technical
and social challenge. It requires an open tech-
nological ramework consisting o, or example,
applications, tools, middleware, remote access to
experimental acilities, instruments and sensors,
as well as monitoring and post-analysis capabili-ties. An operational ramework rom campus level
to international scale is required, as well as a need
or partnerships between the various cyberinra-
structure stakeholders. Overall eectiveness also
depends upon the appropriate social, governance,
legal, economic and incentive structures. Forma-
tive and longitudinal evaluation is also neces-
sary both to inorm iterative design as well as to
develop understanding o the impact o virtual
organizations on enhancing the eectiveness o
discovery and learning.
Learning and Workorce Developmentop-
portunities and requirements recognize that the
ubiquitous and interconnected nature o cyberin-
rastructure will change not only how we teach but
also how we learn. Te uture will see increas-
ingly open access to online educational resources
including courseware, knowledge repositories,
laboratories, and collaboration tools. Collabo-
ratories or science gateways (instances o virtual
organizations) created by research communities
will also oer participation in authentic inquiry-
based learning. Tese new modes and opportuni-
ties to learn and to teach, covering K-12, post-
secondary, the workorce and the general public,
come with their own set o opportunities and
challenges. New assessment techniques will have
to be developed and understood; undergraduate
curricula must be reinvented to ully exploit the
capabilities made possible by cyberinrastructure;
and the education o the proessionals that are
being relied upon to support, develop and deploy
uture generations o cyberinrastructure must be
addressed. In addition, cyberinrastructure will
have an impact on how business will be conductedand members o the workorce must have the
capability to ully exploit the benets aorded by
these new technologies.
Cyberinrastructure-enhanced discovery and
learning is especially exciting because o the op-
portunities it aords or broadened participation
and wider diversity along individual, geographical
and institutional dimensions. o ully realize these
opportunities NSF will identiy and address the
barriers to utilization o cyberinrastructure tools,
services, and resources; promote the training oaculty, educators, students, researchers and the
public; and encourage programs that will explore
and exploit cyberinrastructure, including taking
advantage o the international connectivity it
provides - particularly important as we prepare a
globally engaged workorce.
7/28/2019 CI Vision March07
10/64
7/28/2019 CI Vision March07
11/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 6 -
National Science Foundation March 2007
I. Cyberinfrastructure Drivers and
Opportunities
How does a protein old? What happens tospace-time when two black holes collide? Whatimpact does species gene ow have on an ecologi-cal community? What are the key actors thatdrive climate change? Did one o the trillions ocollisions at the Large Hadron Collider produce aHiggs boson, the dark matter particle, or a blackhole? Can we create an individualized model oeach human being or personalized health caredelivery? How does major technological changeaect human behavior and structure complex so-cial relationships? What answers will we nd toquestions we have yet to ask in the very largedatasets that are being produced by telescopes,sensor networks, and other experimental acilities?
Tese questions and many others are onlynow coming within our ability to answer becauseo advances in computing and related inorma-tion technology. Once used by a handul o eliteresearchers in a ew research communities onselect problems, advanced computing has becomeessential to uture progress across the rontier o
science and engineering. Coupled with continu-ing improvements in microprocessor speeds,converging advances in networking, sotware, visu-alization, data systems and collaboration platormsare changing the way research and education areaccomplished.
odays scientists and engineers need access tonew inormation technology capabilities, such asdistributed wired and wireless observing networkcomplexes, and sophisticated simulation toolsthat permit exploration o phenomena that cannever be observed or replicated by experiment.
Computation oers new models o behavior andmodes o scientic discovery that greatly extendthe limited range o models that can be producedwith mathematics alone or example, chaoticbehavior. Fewer and ewer researchers working
at the rontiers o knowledge can carry out theirwork without cyberinrastructure o one orm oranother.
While hardware perormance has been growingexponentially with gate density doubling every18 months, storage capacity every 12 months,and network capability every 9 months it hasbecome clear that increasingly capable hardware isnot the only requirement or computation-enableddiscovery. Sophisticated sotware, visualizationtools, middleware and scientic applications cre-ated and used by interdisciplinary teams are criti-
cal to turning ops, bytes and bits into scienticbreakthroughs. In addition to these technicalneeds, the exploration o new organizational mod-els and the creation o enabling policies, processes,and economic rameworks are also essential. Tecombined power o these capabilities and ap-proaches is necessary to advance the rontiers oscience and engineering, make seemingly intrac-table problems solvable, and pose proound newscientic questions.
Te comprehensive inrastructure needed tocapitalize on dramatic advances in inormation
technology has been termed cyberinrastructure(CI). Cyberinrastructure integrates hardware orcomputing, data and networks, digitally-enabledsensors, observatories and experimental acilities,and an interoperable suite o sotware and middle-ware services and tools. Investments in interdiscip-linary teams and cyberinrastructure proessionalswith expertise in algorithm development, systemoperations, and applications development arealso essential to exploit the ull power o cyberin-rastructure to create, disseminate, and preservescientic data, inormation and knowledge.
For our decades, NSF has provided leader-ship in the scientic revolution made possibleby inormation technology (Appendices B andC). Trough investments ranging rom super-computing centers and the Internet to sotwareand algorithm development, inormation tech-
The Terashake 2.1 simulation on the opposite page depicts a velocity wavefield as it propagates through the 3Dvelocity structure beneath Southern California. Red and yellow colors indicate regions of compression, whileblue and green colors show regions of dilation. Faint yellow (faults), red (roads), and blue (coast-line) lines addgeographical context.
ChAPter 1CALL to ACtion
7/28/2019 CI Vision March07
12/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science Foundation March 2007
nology has stimulated scientic breakthroughsacross all science and engineering elds. Mostrecently, NSFs Inormation echnology Research(IR) priority area sowed the seeds o broad andintensive collaboration among the computational,
computer, and domain research communities thatsets the stage or this Call to Action.
NSF is the only agency within the U.S. govern-ment that unds research and education across alldisciplines o science and engineering. Over thepast ve years, NSF has held community work-shops, commissioned blue-ribbon panels, and car-ried out extensive internal planning (Appendix B).Tus, it is strategically placed to leverage, coordi-nate and transition cyberinrastructure advances inone eld to all elds o research.
Other ederal agencies, the administration,Congress, the private sector, and other nations areaware o the growing importance o cyberinra-structure to progress in science and engineering.Other ederal agencies have planned improvedcapabilities or specic disciplines, and in somecases to address interdisciplinary challenges.Other countries have also been making signicantprogress in scientic cyberinrastructure. Tus,
the U.S. must engage in and actively benet romcyberinrastructure developments around theworld.
Not only is the time ripe or a coordinated
investment in cyberinrastructure, but progress atthe science and engineering rontiers depends onit. Our communities are in place and are poisedto respond to such an investment.
Working with the science and engineering re-search and education communities and partneringwith other key stakeholders, NSF is ready to lead.
II. Vision, Mission and Principles for
Cyberinfrastructure
A. Vision
NSF will play a leadership role in the develop-ment and support o a comprehensive cyberin-rastructure essential to 21st century advancesin science and engineering research and educa-tion.
B. Mission
NSFs mission or cyberinrastructure (CI) is to:
Develop a human-centered CI that is driven byscience and engineering research and education
opportunities;Provide the science and engineering com-munities with access to world-class CI toolsand services, including those ocused on: highperormance computing; data, data analysisand visualization; networked resources and vir-tual organizations; and learning and workorcedevelopment;
Promote a CI that serves as an agent orbroadening participation and strengthening thenations workorce in all areas o science andengineering;
Provide a sustainable CI that is secure, efcient,reliable, accessible, usable, and interoperable,and that evolves as an essential national inra-structure or conducting science and engineer-ing research and education; and
Create a stable but extensible CI environmentthat enables the research and education com-munities to contribute to the agencys statutorymission.
Visualization of a molecular dynamics simulation of a double strandedDNA molecule as it enters a nanopore in a silicon nitride membrane.
- 7 -
7/28/2019 CI Vision March07
13/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
National Science Foundation March 2007
C. Principles
Te ollowing principles will guide the agencysFY 2006 through FY 2010 investments:
Science and engineering research and educa-tion are oundational drivers o CI.
NSF has a unique leadership role in ormulat-ing and implementing a national CI agendaocused on advancing science and engineering.
Inclusive strategic planning is required to eec-
tively address CI needs across a broad spectrumo organizations, institutions, communities andindividuals, with input to the process providedthrough public comments, workshops, undedstudies, advisory committees, merit review andopen competitions.
Strategic investments in CI resources andservices coupled with enabling policy and orga-nizational ramework are essential to continuedU.S. leadership in science and engineering.
Te integration and sharing o cyberinrastruc-ture assets deployed and supported at national,regional, local, community and campus levelsrepresent the most eective way o construct-ing a comprehensive CI ecosystem suited tomeeting uture needs.
Public and private national and internationalpartnerships that integrate CI users and provid-ers and benet NSFs research and educationcommunities are also essential or enabling
next-generation science and engineering.
Existing strengths, including research programsand CI acilities, serve as a oundation uponwhich to build a CI designed to meet the needso the broad science and engineering commu-
nity.
Merit review is essential or ensuring that thebest ideas are pursued in all areas o CI und-ing.
Regular evaluation and assessment tailored toindividual projects is essential or ensuring ac-countability to all stakeholders.
A collaborative CI governance and coordina-tion structure that includes representatives whocontribute to basic CI research, developmentand deployment, as well as those who use CI,is essential to ensure that CI is responsive to
community needs and empowers research atthe rontier.
III. Goals and Strategies
NSFs vision and mission statements on CI needwell-dened goals and strategies to turn them intoreality. Te goals underlying these statements areprovided below, with each goal ollowed by a briedescription o the strategy to achieve the goal.
Across the CI landscape, NSF will:
Provide communities addressing the mostcomputationally challenging problems withaccess to a world-class, high perormancecomputing (HPC) environment through NSFacquisition and through exchange-o-serviceagreements with other entities, where pos-sible.
NSFs investment strategy or the provision oCI resources and services will be linked to careulrequirements analyses o the computational needso research and education communities. NSFinvestments will be coordinated with those o
other agencies in order to maximize access to thesecapabilities and to provide a range o representa-tive high perormance architectures.
Broaden access to state-o-the-art computingresources, ocusing especially on institutionswith less capability and communities wherecomputational science is an emerging activ-ity.
Robert Patterson demonstrates NCSAs 3D Visualization to Dr.Arden Bement, the Director of NSF, and others during the FY08
NSF budget roll-out.
- 8 -
7/28/2019 CI Vision March07
14/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 9 -
National Science Foundation March 2007
Building on the achievements o current CIservice providers and other NSF investments, theagency will work to make necessary computingresources more broadly available, paying particularattention to emerging and underserved communi-ties.
Support the development and maintenanceo robust systems sotware, programmingtools, and applications needed to close the
growing gap between peak perormance andsustained perormance on actual research
codes, and to make the use o HPC systems,as well as novel architectures, easier andmore accessible.
NSF will build on research in computer scienceand other research areas to provide science andengineering applications and problem-solving en-vironments that more eectively exploit innovativearchitectures and large-scale computing systems.NSF will continue and build on its existing col-laborations with other agencies in support o thedevelopment o HPC sotware and tools.
Support the continued development, expan-sion, hardening and maintenance o end-to-end sotware systems user interaces,workow engines, science and engineeringapplications, data management, analysisand visualization tools, collaborative tools,and other sotware integrated into completescience and engineering systems via middle-ware in order to bring the ull power o anational cyberinrastructure to communitieso scientists and engineers.
NCSAs Cobalt computing system uses a 3D cylindrical configuration to model the sediment discharge of a river into theocean and the initial stages of alluvial fan formation at the rivers mouth.
Cyberinfrastructure will broaden access t o state-of-the art re-sources for learning and discovery, creating new opportunitiesfor participation by emerging and underserved communities.
7/28/2019 CI Vision March07
15/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 10 -
National Science Foundation March 2007
Tese investments will build on the sotwareproducts o current and ormer programs, and willleverage work in core computer science researchand development eorts supported by NSF andother ederal agencies.
Support the development o the comput-ing proessionals, interdisciplinary teams,enabling policies and procedures, and neworganizational structures such as virtualorganizations, that are needed to achieve thescientifc breakthroughs made possible byadvanced CI, paying particular attention toopportunities to broaden the participation ounderrepresented groups.
NSF will continue to improve its under-standing o how participants in its research and
education communities, as well as the scienticworkorce, can use CI. For example, virtualorganizations empower communities o users tointeract, exchange inormation, and access andshare resources through tailored interaces. Someo NSFs investments will ocus on appropriatemechanisms or structures or use, while others willocus on how best to train uture users o CI. NSF
will take advantage o the emerging communitiesassociated with CI that provide unique and specialopportunities or broadening participation in thescience and engineering enterprise.
Support state-o-the-art innovation in datamanagement and distribution systems,including digital libraries and educationalenvironments that are expected to contributeto many o the scientifc breakthroughs o the
21st century.
NSF will oster communication among ore-ront data management and distribution systems,digital libraries, and other education environmentssponsored in its various directorates. NSF willensure that its eorts take advantage o innova-tion in large data management and distribution
activities sponsored by other agencies and throughinternational eorts. Tese developments will playa critical role in decisions that NSF makes aboutstewardship o long-lived data.
Support the design and development o theCI needed to realize the ull scientifc poten-
The DANSE project at CalTech integrates new materials theory with high-performance computing, using data from facili-ties such as DOEs new Spallation Neutron Source in Oak Ridge, TN.
7/28/2019 CI Vision March07
16/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 11 -
National Science Foundation March 2007
tial o NSFs investments in tools and largeacilities, rom observatories and accelera-tors to sensor networks and remote observingsystems.
NSFs investments in large acilities and othertools require new types o CI such as wirelesscontrol o networks o sensors in hostile environ-ments, rapid distribution and analysis o petascaledata sets around the world, adaptive knowledge-based control and sampling systems, and innova-tive visualization systems or collaboration. NSFwill ensure that these projects invest appropriatelyin CI capabilities, promoting the integrated andwidespread use o the unique services provided bythese and other acilities. In addition, NSFs CIprograms will be designed to serve the needs othese projects.
Support the development and maintenanceo the increasingly sophisticated applica-tions needed to achieve the scientifc goals oresearch and education communities.
Te applications needed to produce cutting-edge science and engineering have becomeincreasingly complex. Tey require teams, evencommunities, to develop and sustain wide andlong-term applicability, and they leverage under-lying sotware tools and increasingly common,persistent CI resources such as data repositories
and authentication and authorization services.NSFs investments in applications will involve itsdirectorates and ofces that support domain-spe-cic science and engineering. Special attention willbe paid to the cross-disciplinary nature o much othe work.
Invest in the high-risk/high-gain basic re-search in computer science, computing andstorage devices, mathematical algorithms,and the human/CI interaces that are criticalto powering the uture exponential growth inall aspects o computing, including hardware
speed, storage, connectivity and scientifcproductivity.
NSFs investments in operational CI must becoupled with vigorous research programs in thedirectorates to ensure that operational capabili-ties continue to expand and extend in the uture.Important among these programs are activities tounderstand how humans adopt and use CI. NSFis especially well-placed to oster collaborations
among computer scientists; social, behavioral andeconomic scientists; and other domain scientistsand engineers to understand how humans can bestuse CI, in both research and education environ-ments.
Provide a ramework that will sustainreliable, stable resources and services whileenabling the integration o new technologiesand research developments with a minimumo disruption to users.
NSF will minimize disruption to users by real-izing a comprehensive CI with an architecture andramework that emphasizes interoperability andopen standards, thus providing exibility or up-grades, enhancements and evolutionary changes.Pre-planned arrangements or alternative CI avail-
abilities during competitions, changeovers andupgrades to production operations and serviceswill be made, including cooperative arrangementswith other agencies.
A strategy common to achieving all o thesegoals is partnering nationally and internation-ally, with other agencies, the private sector, andwith universities to achieve a worldwide CI thatis interoperable, exible, efcient, evolving andbroadly accessible. In particular, NSF will takea lead role in ormulating and implementing anational CI strategy.
IV. Planning for Cyberinfrastructure
o implement its cyberinrastructure vision,NSF will develop interdependent plans or eacho the ollowing aspects o CI, with emphasis ontheir integration to create a balanced science- andengineering-driven national CI:
High Perormance Computing
Data, Data Analysis, and Visualization
Virtual Organizations or Distributed
Communities, andLearning and Workorce Development.
Others may be added at a later date.
While these aspects are addressed separately asa means or organizing this document, the centralgoal is the development o a ully-integrated CIramework comprised o the balanced, seamless
7/28/2019 CI Vision March07
17/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 12 -
National Science Foundation March 2007
Researchers upgrade the software of an automated weather station that transmits data to help track theicebergs position in the Antarctic and reports on the microclimate of the ice surface.
blending o these components. Tis will requireintegrative management structures (such as thenewly ormed Ofce o Cyberinrastructure, theNSF-wide Cyberinrastructure Council, and theCyberinrastructure Coordinators Committee), as
well as science-driven, community-basedplanning and implementation processes thatspan all the elements o a truly comprehensive CIramework.
Tese plans will be reviewed annually and willevolve over time, paced by the considerable rateo innovation in computing and communica-tion, and by the growing needs o the science andengineering community or state-o-the-art CI
capabilities. Trough cycles o use-driven innova-tion, NSFs vision will become reality.
7/28/2019 CI Vision March07
18/64
7/28/2019 CI Vision March07
19/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 14 -
National Science Foundation March 2007
I. What Does High Performance
Computing Offer Science and
Engineering?
What are the three-dimensional structures o allo the proteins encoded by the human genome,and how does structure inuence their unction ina human cell? What patterns o emergent behav-ior occur in models o very large societies? Howdo massive stars explode and produce the heavi-est elements in the periodic table? What sort oabrupt transitions can occur in Earths climate andecosystem structure? How do these transitionsoccur, and under what circumstances? I we coulddesign catalysts atom-by-atom, could we trans-orm industrial synthesis? What strategies mightbe developed to optimize management o complex
inrastructure systems? What kind o languageprocessing can occur in large assemblages oneurons? Can we enable integrated planning andresponse to natural and man-made disasters thatprevent or minimize the loss o lie and property?Tese are just some o the important questionsthat researchers wish to answer using contempo-rary tools in a state-o-the-art High PerormanceComputing (HPC) environment.
Using HPC-based applications, researchers
study the properties o minerals at the extremetemperatures and pressures that occur deep withinthe Earth. Tey simulate the development ostructure in the early Universe. Tey probe thestructure o novel phases o matter such as thequark-gluon plasma. HPC capabilities enable themodeling o lie cycles that capture interdependen-
The visualization above, created from data generated by a tornado simulation calculated on the NCSA computing cluster,shows the tornado by spheres colored according to pressure. Orange and blue tubes represent the rising and fallingairflow around the tornado.
NCARs blueice supercomputer, shown on the opposite page, enables scientists to enhance the resolution and complexityof Earth system models, improve climate and weather research, and provide more accurate data to decision makers.
ChAPter 2high PerFormAnCe ComPuting
(2006-2010)
7/28/2019 CI Vision March07
20/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 15 -
National Science Foundation March 2007
cies across diverse disciplines and multiple scalesto create globally competitive manuacturingenterprise systems. And they examine the wayproteins old and vibrate ater they are synthesizedinside an organism. In act, sophisticated numeri-
cal simulations permit scientists and engineers toperorm a wide range o in silico experiments thatwould otherwise be too difcult, too expensive, orimpossible to perorm in the laboratory.
HPC systems and services are also essential tothe success o research conducted with sophisti-cated experimental tools. Without the waveormsproduced by the numerical simulation o blackhole collisions and other astrophysical events,gravitational wave signals cannot be extractedrom the data produced by the Laser Intererom-eter Gravitational Wave Observatory. High-resolu-
tion seismic inversions rom the higher densityo broad-band seismic observations urnished bythe Earthscope project are necessary to determineshallow and deep Earth structure. Simultaneousintegrated computational and experimental test-ing is conducted on the Network or EarthquakeEngineering Simulation to improve seismic design
o buildings and bridges. HPC is essential toextracting the signature o the Higgs boson andsupersymmetric particles two o the scienticdrivers o the Large Hadron Collider rom thepetabytes o data produced in the trillions o
particle collisions.
Science and engineering research and educa-tion enabled by state-o-the-art HPC tools havea direct bearing on the nations competitiveness.I investments in HPC are to have a long-termimpact on problems o national need, such asbioengineering, critical inrastructure protection(or example, the electric power grid), health care,manuacturing, nanotechnology, energy, andtransportation, then HPC tools must deliver highperormance capability or a wide range o scienceand engineering applications.
A functioning ribosome, a complex of three la rge RNA molecule s and fif ty proteins with three mil lion atoms, issimulated on the Texas Advanced Computing Center computer.
7/28/2019 CI Vision March07
21/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 16 -
National Science Foundation March 2007
Results from the Parallel Climate Model, prepared fromdata in the Earth System Grid, depict wind vectors,surface pressure, seas surface temperature and sea iceconcentration.
II. The Next Five Years: Creating
a High Performance Computing
Environment for Petascale Science
and Engineering
NSFs ve-year HPC goal is to enable petascalescience and engineering through the deploymentand support o a world-class HPC environmentcomprising the most capable combination o HPCassets available to the academic community. Tepetascale HPC environment will enable investiga-tions o computationally challenging problemsthat require computers operating at sustainedspeeds on actual research codes o 1015 oatingpoint operations per second (petaops) or thatwork with extremely large data sets on the order o1015 bytes (petabytes).
Petascale HPC capabilities will permit research-ers to perorm simulations that are intrinsicallymulti-scale or that involve multiple simultaneousreactions, such as modeling the interplay amonggenes, microbes, and microbial communities andsimulating the interactions among the ocean,atmosphere, cryosphere and biosphere in Earthsystems models. In addition to addressing themost computationally challenging demands oscience and engineering, new and improvedHPC sotware services will make supercomputingplatorms supported by NSF and other partnerorganizations more efcient, more accessible, and
easier to use.
NSF will support the deployment o a well-en-gineered, scalable, HPC inrastructure designed toevolve as science and engineering research needschange. It will include a sufcient level o diversi-
ty, both in architecture and scale o deployed HPCsystems, to realize the research and education goalso the broad science and engineering community.NSFs HPC investments will be complementedby its simultaneous investments in data analysisand visualization acilities essential to the eectivetransormation o data products into inormationand knowledge.
Te ollowing principles will guide the agencysFY 2006 through FY 2010 investments:
Science and engineering research and educa-
tion priorities will drive HPC investments.Collaborative activities involving science andengineering researchers and private sectororganizations are needed to ensure that HPCsystems and services are optimally conguredto support petascale scientic computing.
Researchers and educators require access toreliable, robust, production-quality HPCresources and services.
HPC-related research and developmentadvances generated in the public and privatesectors, both domestic and oreign, must beleveraged to enrich HPC capabilities.
Te development, implementation and annualupdate o an eective multi-year HPC strategyis crucial to the timely introduction o researchand development outcomes and innovations inHPC systems, sotware and services.
NSFs implementation plan to create a petascaleenvironment includes the ollowing three interre-lated components:
1). Specifcation, Acquisition, Deploymentand Operation o Science-Driven HPC Systems
Architectures
An eective computing environment designedto meet the computational needs o a range oscience and engineering applications will include avariety o computing systems with complementaryperormance capabilities. By 2010, the petascalecomputing environment available to the academicscience and engineering community is likely toconsist o: (i) a signicant number o systems with
7/28/2019 CI Vision March07
22/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 17 -
National Science Foundation March 2007
This numerical simulation, created on the NCSA Itanium Linux Cluster by international researchers, showsthe merger of two black holes and the ripples in space time that are born of t he merger.
peak perormance in the 50-500 teraops range,deployed and supported at the local level by indi-vidual campuses and other research organizations;(ii) multiple systems with peak perormance o500+ teraops that support the work o thousands
o researchers nationally; and, (iii) at least onesystem capable o delivering sustained peror-mance approaching 1015 oating point operationsper second on real applications that consume largeamounts o memory, and/or that work with verylarge data sets projects that demand the highestlevels o computing perormance. All NSF-de-ployed systems will be appropriately balanced andwill include core computational hardware, localstorage o sufcient capacity, and appropriate dataanalysis and visualization capabilities.
Over the FY 2006-2010 period, NSF will ocuson HPC system acquisitions in the 100 teraopsto 10 petaops range, where strategic investmentson a national scale are necessary to ensure inter-national leadership in science and engineering.
Since dierent science and engineering codes mayachieve optimal perormance on dierent HPCarchitectures, it is likely that by 2010 the NSF-supported HPC environment will include bothloosely coupled and tightly coupled systems, withseveral dierent memory models.
o address the challenge o providing theresearch community with access to a range oHPC architectures within a constrained budget,a key element o NSFs strategy is to participatein resource-sharing with other ederal agencies. A
7/28/2019 CI Vision March07
23/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 18 -
National Science Foundation March 2007
Massachusetts Institute of Technology researchers are developingcomputational tools to analyze the structure of any protein, suchas the human ubiquitin hydrolase (shown), for knots.
strengthened interagency partnership will ocus, tothe extent practicable, on ensuring shared accessto ederal leadership-class resources with dierentarchitectures, and on the coordination o invest-ments in HPC system acquisition and operation.
Te Department o Energys Ofce o Scienceand National Nuclear Security Administrationhave very active programs in leadership comput-ing. Te Department o Deenses (DOD) HighPerormance Computing Modernization Ofce(HPCMOD) provides HPC resources and servicesor the DOD science and engineering community,while NASA is deploying signicant comput-ing systems that are also o interest to NSF PIs.NSF will explore enhanced coordination mecha-nisms with other appropriate ederal agencies tocapitalize on their common interests. It will seekopportunities to make coordinated and collab-
orative investments in science-driven hardwarearchitectures in order to increase the diversity oarchitectures o leadership class systems availableto researchers and educators around the country,to promote sharing o lessons learned, and toprovide a richer HPC environment or the usercommunities supported by each agency.
Strong partnerships involving universities, in-dustry and government are also critical to success.NSF will also promote resource sharing betweenand among academic institutions to optimize theaccessibility and use o HPC assets deployed andsupported at the campus level.
In addition to leveraging the promise o PhaseIII o the Deense Advanced Research ProjectsAgency (DARPA)-sponsored High ProductivityComputing Systems (HPCS) program, the agencywill establish a discussion and collaboration orumor scientists and engineersincluding computa-tional and computer scientists and engineersandHPC system vendors, in order to ensure that HPCsystems are optimally congured to support state-o-the-art scientic computing. On the one hand,these discussions will keep NSF and the academiccommunity inormed about new products, prod-
uct roadmap and technology challenges at variousvendor organizations. On the other, they will pro-vide HPC system vendors with insights into themajor concerns and needs o the academic scienceand engineering community. Tese activities willlead to better alignment between applications andhardware both by inuencing algorithm designand by inuencing system integration.
2). Development and Maintenance o Sup-porting Sotware: New Design Tools, Peror-mance Modeling Tools, Systems Sotware, andFundamental Algorithms.
Many o the HPC sotware and service buildingblocks in scientic computing are common to anumber o science and engineering applications.A supporting sotware and service inrastructurewill accelerate the development o the scienticapplication codes needed to solve challengingscientic problems, and will help insulate thesecodes rom the evolution o uture generations oHPC hardware.
Supporting sotware services include theprovision o intelligent development and prob-lem-solving environments and tools. Tese tools
are designed to provide improvements in ease ouse, reusability o modules, and portable peror-mance. ools and services that take advantage ocommonly-supported sotware tools can deliversimilar work environments across dierent HPCplatorms, greatly reducing the time-to-solutiono computationally-intensive research problemsby permitting local development o researchcodes that can then be rapidly transerred to, orincorporate services provided by, larger production
7/28/2019 CI Vision March07
24/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 19 -
National Science Foundation March 2007
environments. Tese tools, and workows builtrom collections o such tools, can also be pack-aged or more general use. Applications scientistsand engineers will also benet rom the develop-ment o new tools and approaches to debugging,
perormance analysis, and perormance optimiza-tion.
Specic applications depend on a broad classo numerical and non-numerical algorithms thatare widely used by many applications, includinglinear algebra, ast spectral transorms, optimiza-tion algorithms, multi-grid methods, adaptivemesh renement, symplectic integrators, andsorting and indexing routines. o date, improvedor new algorithms have been important contribu-tors to perormance improvements in science andengineering applications, the development o
multi-grid solvers or elliptic partial dierentialequations being a prime example. Innovationsin algorithms will have a signicant impact onthe perormance o applications sotware. Tedevelopment o algorithms or dierent architec-tural environments is an essential component othe eort to develop portable, scalable, applica-tions sotware. Other important sotware servicesinclude libraries or communications services, suchas MPI and OpenMP.
Te development and deployment o operatingsystems and compilers that scale to hundreds othousands o processors are also necessary. Tey
must provide eective ault-tolerance and eec-tively insulate users rom parallelization, as well asprovide protection rom latency management andthread management issues. o test new develop-ments at large scales, operating systems and kernelresearchers and developers must have access to theinrastructure necessary to test their developmentsat scale.
Te sotware provider community will be asource or: applied research and developmento supporting technologies; harvesting promis-ing supporting sotware technologies rom the
research communities; perorming scalabil-ity/reliability tests to explore sotware viability;developing, hardening and maintaining sotwarewhere necessary; and acilitating the transition ocommercially viable sotware into the private sec-tor. It is anticipated that this community will alsosupport general sotware engineering consultingservices or science and engineering applications,and will provide sotware engineering consultingsupport to individual researchers and research and
education teams as necessary.
Te sotware provider community will beexpected to promote sotware interoperabilityamong the various components o the cyberinra-
structure sotware stack, such as those generatedto provide modeling and simulation data, dataanalysis and visualization services, and networkedresources and virtual organization capabilities. (SeeChapters 3 and 4 in this document.) Tis will beaccomplished through the creation and utiliza-tion o appropriate sotware test harnesses and willensure that sufcient conguration controls arein place to support the range o HPC platormsused by the research and education community.Te applications community will identiy neededimprovements in supporting sotware and willprovide input and eedback on the quality o
services provided.
NSF will seek guidance on the evolution osotware support rom representatives o academia,ederal agencies and private sector organizations,including third party and system vendors. Teywill provide input on the strengths, weaknesses,opportunities and gaps in the sotware servicescurrently available to the science and engineeringresearch and education communities.
o minimize duplication o eort and optimizethe value o HPC services provided to the scienceand engineering community, NSFs investments
will be coordinated with those o other agencies.DOE currently invests in sotware inrastructurecenters through the Scientic Discovery throughAdvanced Computing (SciDAC) program, whileDARPAs investments in the HPCS program con-tribute signicant systems sotware and hardwareinnovations. NSF will seek to leverage and addvalue to ongoing DOE and DARPA eorts in thisarea.
7/28/2019 CI Vision March07
25/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 20 -
National Science Foundation March 2007
Two skulls of separate species of pterosaurs were scanned at the High-Resolution X-ray Computed Tomo-
graphy Facility at The University of Texas at Austin and the data was then fed to DigiMorph digital libraryto produce 2-D and 3-D structural visualizations.
3). Development and Maintenance o Por-table, Scalable Applications Sotware
odays microprocessor-based terascale comput-ers place considerable demands on our ability tomanage parallelism, and to deliver large ractionso peak perormance. As the agency seeks tocreate a petascale computing environment, it willembrace the challenge o developing or convertingkey application codes to run eectively on newand evolving system architectures.
Over the FY 2006 through 2010 period, NSFwill make signicant new investments in the de-velopment, hardening, enhancement and mainte-nance o scalable applications sotware, includingcommunity models, to exploit the ull potential ocurrent terascale and uture petascale systems ar-chitectures. Te creation o well-engineered, easy-to-use sotware will reduce the complexity andtime-to-solution o todays challenging scienticapplications. NSF will promote the incorpora-
tion o sound sotware engineering approachesin existing widely-used research codes and in thedevelopment o new research codes. Multidiscip-linary teams o researchers will work togetherto create, modiy and optimize applications orcurrent and uture systems using perormancemodeling tools and simulators.
Since the nature and genesis o science andengineering codes varies across the research land-scape, a successul programmatic eort in this areawill weave together several strands. A new activitywill be designed to take applications that have thepotential to be widely used within a community orcommunities, to harden these applications basedon modern sotware engineering practices, todevelop versions or the range o architectures thatscientists wish to use them on, to optimize themor modern HPC architectures, and to provideuser support.
7/28/2019 CI Vision March07
26/64
7/28/2019 CI Vision March07
27/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 22 -
National Science Foundation March 2007
An ar tist s conception (above) depicts fundamental NEON observa tory ins trumentation and sys tems as well aspotential spatial organization of the environmental measurements made by these instruments and systems.
The image on the opposite page shows the action of t he enzyme cellulase on cellulose using the CHARMMcommunity code in a simulation carried out at SDSC. NREL will use the simulation to help develop strategies forefficient large-scale conversion of biomass into ethanol.
I. A Wealth of Scientific
Opportunities Afforded by
Digital Data
Science and engineering research and educationhave become increasingly data-intensive as a resulto the prolieration o digital technologies, instru-mentation, and pervasive networks through whichdata are collected, generated, shared and analyzed.Worldwide, scientists and engineers are produc-ing, accessing, analyzing, integrating and storingterabytes o digital data daily through experi-mentation, observation and simulation. More-over, the dynamic integration o data generatedthrough observation and simulation is enablingthe development o new scientic methods thatadapt intelligently to evolving conditions to revealnew understanding. Te enormous growth in theavailability and utility o scientic data is increas-ing scholarly research productivity, accelerating the
transormation o research outcomes into productsand services, and enhancing the eectiveness olearning across the spectrum o human endeavor.
New scientic opportunities are emerging romincreasingly eective data organization, access andusage. ogether with the growing availability andcapability o tools to mine, analyze and visual-ize data, the emerging data cyberinrastructureis revealing new knowledge and undamentalinsights. For example, analyses o DNA sequence
data are providing remarkable insights into theorigins o man, revolutionizing our understand-ing o the major kingdoms o lie, and revealingstunning and previously unknown complexity inmicrobial communities. Sky surveys are changingour understanding o the earliest conditions o theuniverse and providing comprehensive views ophenomena ranging rom black holes to superno-vae. Researchers are monitoring socioeconomicdynamics over space and time to advance our
ChAPter 3DAtA, DAtA AnALysis, AnD visuALizAtion
(2006-2010)
7/28/2019 CI Vision March07
28/64
- 23 -
National Science Foundation March 2007
The National Virtual Observatorys Sky Statistics Survey allows astrono-mers to get a fast inventory of astronomical objects from various catalogs.
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
understanding o individual and group behav-ior and its relationship to social, economic andpolitical structures. Using combinatorial methods,scientists and engineers are generating librarieso new materials and compounds or health and
engineering, and environmental scientists andengineers are acquiring and analyzing streamingdata rom massive sensor networks to understandthe dynamics o complex ecosystems.
In this dynamic research and educationenvironment, science and engineering data areconstantly being collected, created, deposited,accessed, analyzed and expanded in the pursuit onew knowledge. In the uture, U.S. internationalleadership in science and engineering will increas-ingly depend upon our ability to leverage thisreservoir o scientic data captured in digital orm,
and to transorm these data into inormation andknowledge aided by sophisticated data mining,integration, analysis and visualization tools.
Tis chapter sets orth a ramework in whichNSF will work with its partners in science andengineering public and private sector organiza-tions both oreign and domestic representing dataproducers, scientists, engineers, managers andusers alike to address data acquisition, access,usage, stewardship and management challenges ina comprehensive way.
II. Definitions
A. Data, Metadata and Ontologies
In this document, data and digital data areused interchangeably to reer to data and inorma-tion stored in digital orm and accessed electroni-cally.
Data. For the purposes o this document, dataare any and all complex data entities rom ob-servations, experiments, simulations, models,and higher order assemblies, along with the
associated documentation needed to describeand interpret the data.
Metadata. Metadata are a subset o data, andare data about data. Metadata summarize datacontent, context, structure, interrelationships,and provenance (inormation on history andorigins). Tey add relevance and purpose todata, and enable the identication o similardata in dierent data collections.
Ontology. An ontology is the systematicdescription o a given phenomenon. It otenincludes a controlled vocabulary and relation-ships, captures nuances in meaning and enablesknowledge sharing and reuse.
B. Data Collections
Tis document adopts the denition o datacollection types provided in the NSB report onLong-Lived Digital Data Collections, where datacollections are characterized as being one o threeunctional types:
Research Collections. Authors are individualinvestigators and investigator teams. Researchcollections are usually maintained to serve im-mediate group participants only or the lie o
a project, and are typically subjected to limitedprocessing or curation. Data may not conormto any data standards.
Resource Collections. Resource collectionsare authored by a community o investigators,oten within a domain o science or engineer-ing, and are oten developed with community-level standards. Budgets are oten intermediatein size. Lietime is between the mid- andlong-term.
7/28/2019 CI Vision March07
29/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 24 -
National Science Foundation March 2007
The GLORIAD network, an optical network ring aroundthe northern hemisphere, promotes new opportunities forcooperation and understanding for scientists, educators andstudents.
Reerence Collections. Reerence collectionsare authored by and serve large segmentso the science and engineering communityand conorm to robust, well-established andcomprehensive standards, which oten lead to
a universal standard. Budgets are large and areoten derived rom diverse sources with a viewto indenite support.
Boundaries between the types are not rigid, andcollections originally established as research col-lections may evolve over time to become resourceand/or reerence collections. In this document,the term data collection is construed to includeone or more databases and their relevant tech-nological implementation. Data collections aremanaged by organizations and individuals withthe necessary expertise to structure them and to
support their eective use.
III. Developing a Coherent Data
Cyberinfrastructure in a Complex
Global Context
Since data and data collections are owned ormanaged by a wide range o communities, orga-nizations and individuals around the world, NSFmust work in an evolving environment constantlybeing shaped by developing international andnational policies and treaties, community-spe-
cic policies and approaches, institutional-levelprograms and initiatives, individual practices, andcontinually advancing technological capabilities.
At the international level, a number o nationsand international organizations have already recog-nized the broad societal, economic, and scienticbenets that result rom open access to scienceand engineering digital data. In 2004, more than30 nations, including the United States, declaredtheir joint commitment to work toward the es-tablishment o common access regimes or digitalresearch data generated through public unding.Since the international exchange o scientic data,inormation and knowledge promises to signi-cantly increase the scope and scale o researchand its corresponding impact, these nations areworking together to dene the implementationsteps necessary to enable the global science andengineering system.
Te U.S. community is engaged through theCommittee on Data or Science and echnology
(CODAA). Te U.S. National Committee orCODAA (USNC/CODAA) is working withinternational CODAA partners, including theInternational Council or Science (ICSU), theInternational Council or Scientic and echni-cal Inormation (ICSI), the World Data Centers(WDCs) and others, to accelerate the develop-ment o a global open-access scientic data and
inormation resource, through the construction oan online open access knowledge environment,as well as through targeted projects. Te GlobalInormation Commons or Science is a multi-stakeholder initiative arising out o the secondphase o the World Summit on the InormationSociety that can provide important opportunitiesor international coordination and cooperation.Te goals o this initiative include improvingunderstanding o the benets o access to scienticdata and inormation, promoting successul insti-tutional and legal models or providing sustainableaccess, and enhancing coordination among the
many science and engineering stakeholders aroundthe world.
A number o international science and engineer-ing communities have also been developing datamanagement and curation approaches or reer-ence and resource collections. For example, theinternational Consultative Committee or SpaceData Standards (CCSDS) dened an archivereerence model and service categories or the inter-
7/28/2019 CI Vision March07
30/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 25 -
National Science Foundation March 2007
Images produced by Montage on SDSC TeraGrid fromthe 2MASS all-sky survey, provide astronomers with newinsights into the large-scale structure of the Milky Way.
mediate and long-term storage o digital data rel-evant to space missions. Tis eort produced theOpen Archival Inormation System (OAIS), nowadopted as the de acto standard or buildingdigital archives, and provided evidence that a com-
munity-ocused activity can have much broaderimpact than originally intended. In anotherexample, the Inter-University Consortium orPolitical and Social Research (ICPSR) - a member-ship-based organization with over 500 membercolleges and universities around the world - main-tains and provides access to a vast archive o socialscience data. ICPSR serves as a content manage-ment organization, preserving relevant socialscience data and migrating them to new storagemedia as technology changes, and also providesuser support services. ICPSR recently announcedplans to establish an international standard or
social science documentation. Similar activitiesin other communities are also underway. Clearly,NSF must maintain a presence in, support, andadd value to these ongoing international discus-sions and activities.
Activities on an international scale are comple-mented by activities within nation states. In theUnited States, a number o organizations andcommunities o practice are exploring mechanismsto establish common approaches to digital data ac-cess, management and curation. For example, theResearch Library Group (RLG a not-or-protmembership organization representing libraries,
archives and museums) and the U.S. NationalArchives and Records Administration (NARA asister agency whose mission is to provide directionand assistance to ederal agencies on records man-agement) are producing certication requirementsor establishing and selecting reliable digital inor-mation repositories. RLG and NARA intend theirresults to be standardized via the InternationalOrganization o Standardization (ISO) ArchivingSeries, and may impact all data collections types.Te National Institutes o Health (NIH) NationalCenter or Biotechnology Inormation plays animportant role in the management o genome data
at the national level, supporting public databases,developing sotware tools or analyzing data, anddisseminating biomedical inormation.
At the institutional level, colleges and uni-versities are developing approaches to digitaldata archiving, curation and analysis. Tey aresharing best practices to develop digital librariesthat collect, preserve, index and share researchand education material produced by aculty and
other individuals within their organizations. Tetechnological implementations o these systemsare oten open-source and support interoperabilityamong their adopters. University-based researchlibraries and research librarians are positioned to
make signicant contributions in this area, wherestandard mechanisms or access and maintenanceo scientic digital data may be derived rom exist-ing library standards developed or print material.Tese eorts are particularly important to NSFas the agency considers the implications o notonly making all data generated with NSF und-ing broadly accessible, but o also promoting theresponsible organization and management o thesedata so that they are widely usable.
IV. The Next Five Years: Towards a
National Digital Data Framework
Motivated by a vision in which science andengineering digital data are routinely depositedin well-documented orm, are regularly and easilyconsulted and analyzed by specialists and non-spe-cialists alike, are openly accessible while suitablyprotected, and are reliably preserved, NSFs ve-year goal is twoold:
o catalyze the development o a system oscience and engineering data collections that isopen, extensible and evolvable; and
o support development o a new generationo tools and services acilitating data mining,
7/28/2019 CI Vision March07
31/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 26 -
National Science Foundation March 2007
The IRIS Seismic Monitor System allows scientists and others to monitorglobal earthquakes in near real-time, visit seismic stations world-wide,and search the web for earthquake information.
integration, analysis, and visualization essen-tial to turning data into new knowledge andunderstanding.
Te resulting national digital data rameworkwill be an integral component in the national
cyberinrastructure ramework described in thisdocument. It will consist o a range o data col-lections and managing organizations, networkedtogether in a exible technical architecture usingstandard, open protocols and interaces, anddesigned to contribute to the emerging globalinormation commons. It will be simultaneouslylocal, regional, national and global in nature, andwill evolve as science and engineering research andeducation needs change and as new science andengineering opportunities arise. Widely acces-sible tools and services will permit scientists andengineers to access and manipulate these data to
advance the science and engineering rontier.
In print orm, the preservation process ishandled through a system o libraries and otherrepositories throughout the country and aroundthe globe. wo eatures o this print-based systemmake it robust. First, the diversity o businessmodels deriving support rom a variety o sourcesmeans that no single entity bears sole responsibil-ity or preservation, and the system is resilient tochanges in any particular sector. Second, thereis overlap in the collections, and redundancy ocontent reduces the potential or catastrophic loss
o inormation.
Te national data ramework is envisionedto provide an equally robust and diverse systemor digital data management and access. It willpromote interoperability between data collectionssupported and managed by a range o organiza-tions and organization types; provide or appropri-ate protection and reliable long-term preservationo digital data; deliver computational perormance,data reliability and movement through sharedtools, technologies and services; and accommo-date individual community preerences. NSF willalso develop a suite o coherent data policies thatemphasize open access and eective organizationand management o digital data, while respectingthe data needs and requirements within scienceand engineering domains.
Te ollowing principles will guide the agencysFY 2006 through FY 2010 investments:
Science and engineering research and educa-tion opportunities and priorities will motivateNSF investments in data cyberinrastructure.
Science and engineering data generated withNSF unding will be readily accessible and eas-
ily usable, and will be appropriately, responsi-bly and reliably preserved.
Broad community engagement is essential tothe prioritization and evaluation o the utilityo scientic data collections, including the pos-sible evolution rom research to resource andreerence collection types.
Continual exploitation o data in the creationo new knowledge requires that investigatorshave access to the tools and services necessaryto locate and access relevant data, and under-stand its structure sufciently to be able to
interpret and (re)analyze what they nd.Te establishment o strong, reciprocal,international, interagency and public-privatepartnerships is essential to ensure all stakehold-ers are engaged in the stewardship o valuabledata assets. ransition plans, addressing issuessuch as media, stewardship and standards, willbe developed or valuable data assets, to protectdata and assure minimal disruption to thecommunity during transition periods.
Mechanisms will be created to share datastewardship best practices between nations,communities, organizations and individuals.
In light o legal, ethical and national securityconcerns associated with certain types o data,mechanisms essential to the development oboth statistical and technical ways to protectprivacy and condentiality will be supported.
7/28/2019 CI Vision March07
32/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 27 -
National Science Foundation March 2007
Researchers check functionality and performance of the Compact Muon Solenoid detector at CERN before its closure.Built on the Large Hadron Collider, it provides a magnetic field of 4T.
o date, challenges associated with eectivestewardship and preservation o scientic data
have been more tractable when addressed throughcommunities o practice that may derive supportrom a range o sources. For example, NSF sup-ports the Incorporated Research Institutions orSeismology (IRIS) consortium to manage seismol-ogy data. Jointly with NIH and DOE, the agencysupports the Protein Data Bank to manage dataon the three-dimensional structures o proteinsand nucleic acids. Multiple agencies support theUniversity Consortium or Atmospheric Research,an organization that has provided access to at-mospheric and oceanographic data sets, simula-tions, and outcomes extending back to the 1930s
through the National Center or AtmosphericResearch.
Existing collections and managing organizationmodels reect dierences in culture and practicewithin the science and engineering community.As community proxies, data collections and theirmanaging organizations can provide a ocus orthe development and dissemination o appropri-
ate standards or data and metadata content andormat, guided by an appropriate community-dened governance approach. Tis is not a staticprocess, as new disciplinary elds and approaches,data types, organizational models and inormation
strategies inexorably emerge. Tis is discussed indetail in the Long-Lived Digital Data Collectionsreport o the National Science Board.
Since data are held by many ederal agen-cies, commercial and non-prot organizations,and international entities, NSF will oster theestablishment o interagency, public-private andinternational consortia charged with providingstewardship or digital data collections to pro-mote interoperability across data collections. Teagency will work with the broad community oscience and engineering producers, managers,
scientists and users to develop a common con-ceptual ramework. A ull range o mechanismswill be used to identiy and build upon commonground across domain communities and managingorganizations, engaging all stakeholders. Activitieswill include: the support o new projects; devel-opment and implementation o evaluation andassessment criteria that, among other things, reveallessons learned across communities; support o
A. A Coherent Organizational Framework -Data Collections and Managing Organizations
7/28/2019 CI Vision March07
33/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 28 -
National Science Foundation March 2007
A simulated event of the col lision of t wo pro tons in the ATLAS experi-ment. The colors of the tracks emanating from the center show thedifferent types of particles emerging from the collision.
community and intercommunity workshops; andthe development o strong partnerships with otherstakeholder organizations. Stakeholders in theseactivities include data authors, data managers, datascientists and engineers, and data users represent-
ing a diverse range o communities and organiza-tions, including universities and research librar-ies, government agencies, content managementorganizations and data centers, and industry.
o identiy and promote lessons learned acrossmanaging organizations, NSF will continue topromote the coalescence o appropriate collec-tions with overlapping interests, approaches andservices. Tis reduces data-driven ragmentationo science and engineering domains. Progress isalready being made in some areas. For example,NSF has been working with the environmental
science and engineering community to promotecollaboration across disciplines ranging rom ecol-ogy and hydrology to environmental engineering.Tis has resulted in the emergence o commoncyberinrastructure elements and new interdiscip-linary science and engineering opportunities.
B. Developing A Flexible Technological Archi-tecture
From a technological perspective, the nationaldata ramework must provide or reliable preserva-tion, access, analysis, interoperability, and datamovement, possibly using a web or grid services
distributed environment. Te architecture mustuse standard open protocols and interaces toenable the broadest use by multiple communities.It must acilitate user access, analysis and visualiza-tion o data, addressing issues such as authentica-tion, authorization and other security concerns,and data acquisition, mining, integration, analysisand visualization. It must also support complexworkows enabling data discovery. Such an archi-tecture can be visualized as a number o layers pro-viding dierent capabilities to the user, includingdata management, analysis, collaboration tools,and community portals. Te connections among
these layers must be transparent to the end user,and services must be available as modular unitsresponsive to individual or community needs. Tesystem is likely to be implemented as a series odistributed applications and operations supportedby a number o organizations and institutions dis-tributed throughout the country. It must provideor the replication o data resources to reduce thepotential or catastrophic loss o digital inorma-tion through repeated cycles o systems migration
and all other causes since, unlike printed records,the media on which digital data are stored and thestructures o the data are relatively ragile.
High quality metadata, which summarize data
content, context, structure, interrelationships, andprovenance (inormation on history and origins),are critical to successul inormation management,annotation, integration and analysis. Metadatatake on an increasingly important role when ad-dressing issues associated with the combination odata rom experiments, observations and simula-tions. In these cases, product data sets requiremetadata that describe, or example, relevantcollection techniques, simulation codes or point-ers to archived copies o simulation codes, andcodes used to process, aggregate or transorm data.Tese metadata are essential to create new knowl-
edge and to meet the reproducibility imperativeo modern science. Metadata are oten associatedwith data via markup languages, representinga consensus around a controlled vocabulary todescribe phenomena o interest to the commu-nity, and allowing detailed annotations o data tobe embedded within a data set. Because there isoten little awareness o markup language develop-ment activities within science and engineeringcommunities, eort is expended reinventing whatcould be adopted or adapted rom elsewhere. Sci-entists and engineers thereore need access to toolsand services that help ensure that metadata areautomatically captured or created in real-time.
7/28/2019 CI Vision March07
34/64
CyberinfrastruCture Visionfor 21st CenturyDisCoVery
- 29 -
National Science Foundation March 2007
Eective data analysis tools apply computation-al techniques to extract new knowledge througha better understanding o the data and its redun-dancies and relationships by ltering extraneousinormation and by revealing previously unseen
patterns. For example, the Large Hadron Col-lider at CERN generates such massive data setsthat the detection o both expected events, suchas the Higgs boson, and unexpected phenomenarequire the development o new algorithms, bothto manage data and to analyze it. Algorithmsand their implementations must be developed orstatistical sampling, or visualization, to enable thestorage, movement and preservation o enormousquantities o data, and to address other unoreseenproblems certain to arise.
Scientic visualization, including not just static
images but also animation and interaction, leadsto better analysis and enhanced understanding.Currently, many visualization systems are do-main or application-specic and require a certaincommitment to understanding or learning touse them. Making visualization services moretransparent to the user lowers the threshold o us-ability and accessibility, and makes it possible ora wider range o users to explore a data collection.Analysis o data streams also introduces problemsin data visualization and may require new ap-proaches or representing massive, heterogeneousdata streams.
Deriving knowledge rom large data setspresents specic scaling problems due to the sheernumber o items, dimensions, sources, users, anddisparate user communities. Te human ability toprocess visual inormation can augment analysis,especially when analytic results are presented initerative and interactive ways. Visual analytics, thescience o analytical reasoning enabled by interac-tive visual interaces, can be used to synthesize theinormation content and derive insight rom mas-sive, dynamic, ambiguous, and even conictingdata. Suitable ully interactive visualizations helpabsorb vast amounts o data directly, to enhance
ones ability to interpret and analyze otherwiseoverwhelming data. Researchers can thus detectthe expected and discover the unexpected, uncov-ering hidden associations and deriving knowledgerom inormation. As an added benet, theirinsights are more easily and eectively communi-cated to others.
Creating and deploying visualization servicesrequires new rameworks or distributed applica-
tions. In common with other cyberinrastructurecomponents, visualization requires easy-to-use,modular, extensible applications that capitalize onthe reuse o existing technology. odays successulanalysis and visualization applications use a pipe-
line, component-based system on a single machineor across a small number o machines. Extendingto the broader distributed, heterogeneous cyber-inrastructure system will require new interacesand work in undamental graphics and visualiza-tion algorithms that can be run across remote anddistributed settings.
o address this range o needs or data tools andservices, NSF will work with the broad commu-nity to identiy and prioritize needs. In makinginvestments, NSF will complement private sectoreorts, or example, those producing sophisticated
indexing and search tools and packaging them asdata services. NSF will support projects to con-duct applied research and development o promis-ing, interoperable