CI Vision March07

Embed Size (px)

Citation preview

  • 7/28/2019 CI Vision March07

    1/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science FoundationCyberinfrastructure Council

    March 2007

  • 7/28/2019 CI Vision March07

    2/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science Foundation March 2007

    About the Cover

    Te visualization on the cover depicts a single cycle odelay measurements made by a CAIDA Internet peror-mance monitor. Te graph was created using the Walrusgraph visualization tool, designed or interactively visual-izing large directed graphs in 3-Dimensional space. Formore inormation: http://www.caida.org/tools/visualiza-tion/walrus/

  • 7/28/2019 CI Vision March07

    3/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science Foundation March 2007

    Dr. Arden L. Bement, Jr. , Director of Na-tional Science Foundation

    - i -

    Dear Colleague:

    I am pleased to present NSFs CyberinrastructureVision or 21st Century Discovery. Tis document,developed in consultation with the wider science,engineering, and education communities, lays outan evolving vision that will help to guide the Foun-dations uture investments in cyberinrastructure.

    At the heart o the cyberinrastructure visionis the development o a cultural community that

    supports peer-to-peer collaboration and new modeso education based upon broad and open access toleadership computing; data and inormation re-sources; online instruments and observatories; andvisualization and collaboration services. Cyberinra-structure enables distributed knowledge communi-ties that collaborate and communicate across disci-plines, distances and cultures. Tese research andeducation communities extend beyond traditionalbrick-and-mortar acilities, becoming virtual organi-zations that transcend geographic and institutional

    boundaries. Tis vision is new, exciting and bold.

    Realizing the cyberinrastructure vision described in this document will requirethe broad participation and collaboration o individuals rom all elds and institu-tions, and across the entire spectrum o education. It will require leveraging resourcesthrough multiple and diverse partnerships among academia, industry and government.An important challenge is to develop the leadership to move the vision orward inanticipation o a comprehensive cyberinrastructure that will strengthen innovation,economic growth and education.

    Sincerely,

    Arden L. Bement, Jr.Director

    Letter From the DireCtor

  • 7/28/2019 CI Vision March07

    4/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science Foundation March 2007

    - ii -

    Te National Science Foundations Cyberinrastructure Council (CIC)1, based on extensive input rom

    the research community, has developed a comprehensive vision to guide the Foundations uture invest-

    ments in cyberinrastructure (CI). In 2005, our multi-disciplinary, cross-oundational teams were created

    and charged with drating a vision or cyberinrastructure in our overlapping and complementary areas:

    1) High Perormance Computing, 2) Data, Data Analysis, and Visualization, 3) Cyber Services and Vir-

    tual Organizations, and 4) Learning and Workorce Development. Drat versions o the document were

    posted on the NSF website and public comments were solicited rom the community. Tese drats were

    also reviewed or comment by the National Science Board. Te National Science Foundation thanks all

    o those who provided eedback on the Cyberinrastructure Vision or 21st Century Discovery document.

    Your comments were careully reviewed and considered during preparation o this version o the docu-

    ment, which is intended to be a living document, and will be updated periodically.

    We acknowledge the ollowing NSF personnel who served on the strategic planning teams and whose

    eorts made this document possible. We especially acknowledge Deborah Craword, who served as acting

    director or OCI rom July 2005 to June 2006, and whose leadership was instrumental in the ormulation

    o this document.

    High Perormance Computing (HPC) CI eam: Deborah Craword (Chair), Leland Jameson, Margaret

    Leinen (CIC Representative), Jos Muoz, Stephen Meacham, Michael Plesniak

    Data CI eam: Cheryl Eavey, James French, Christopher Greer, David Lightoot (CIC Representative),

    Elizabeth Lyons, Fillia Makedon, Daniel Newlon, Nigel Sharp, Sylvia Spengler (Chair)

    Virtual Organizations (VO) CI eam: Tomas Baerwald, Elizabeth Blood, Charles Boudin, Arthur

    Goldstein, Joy Pauschke (Co-Chair), Randal Ruchti, Bonnie Tompson, Kevin Tompson (Co-Chair),

    Michael urner (CIC Representative)

    Learning and Workorce Development (LWD) CI eam: James Collins (CIC Representative), JaniceCuny, Semahat Demir, Lloyd Douglas, Debasish Dutta (Chair), Miriam Heller, Sally OConnor, Michael

    Smith, Harold Stolberg, Lee Zia

    1 Complete list o acronyms can be ound in Appendix A.

    ACknowLeDgements

    PreFACe

  • 7/28/2019 CI Vision March07

    5/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science Foundation March 2007

    - iii -

    Letter From the DireCtor

    PreFACe

    ACknowLeDgements

    exeCutive summAry 2

    1 CALL to ACtion 6

    CyberinfrastructureDriversandOpportunities 6

    Vision,MissionandPrinciplesforCyberinfrastructure 7

    GoalsandStrategies 8

    PlanningforCyberinfrastructure 11

    2 high PerFormAnCe ComPuting (2006-2010) 14

    WhatDoesHighPerformanceComputingOfferScienceandEngineering? 14

    TheNextFiveYears:CreatingaHighPerformanceComputingEnvironment

    forPetascaleScienceandEngineering 16

    3 DAtA, DAtA AnALysis, AnD visuALizAtion (2006-2010) 22

    AWealthofScientificOpportunitiesAffordedbyDigitalData 22

    Definitions 23

    DevelopingaCoherentDataCyberinfrastructureinaComplexGlobalContext 24

    TheNextFiveYears:TowardsaNationalDigitalDataFramework 25

    4 virtuAL orgAnizAtions For DistributeD Communities (2006-2010) 32

    NewFrontiersinScienceandEngineeringThroughNetworkedResourcesand VirtualOrganizations 32

    TheNextFiveYears:EstablishingaFlexible,OpenCyberinfrastructure

    FrameworkforVirtualOrganizations 33

    5 LeArning AnD workForCe DeveLoPment (2006-2010) 38

    CyberinfrastructureandLearning 38

    BuildingCapacityforCreationandUseofCyberinfrastructure 38

    UsingCyberinfrastructuretoEnhanceLearning 40

    TheNextFiveYears:LearningAboutandWithCyberinfrastructure 40

    APPenDiCes 44

    Acronyms 44

    RepresentativeReportsandWorkshops 46

    ChronologyofNSFInformationTechnologyInvestments 49

    ManagementofCyberinfrastructure 50

    RepresentativeDistributedResearchCommunities(VirtualOrganizations) 51

    imAge CreDits 57

    tAbLe oF Contents

    I

    II

    III

    IV

    I

    II

    I

    II

    III

    IV

    I

    II

    I

    II

    III

    IV

    A

    B

    C

    D

    E

  • 7/28/2019 CI Vision March07

    6/64

  • 7/28/2019 CI Vision March07

    7/64

    - 2 -

    National Science FoundationCyberinfrastruCture Vision

    for 21st CenturyDisCoVeryMarch 2007

    NSFs Cyberinrastructure Vision or 21st

    Century Discoveryis presented in a set o interre-

    lated chapters that describe the various challenges

    and opportunities in the complementary areas that

    make up cyberinrastructure: computing systems,

    data, inormation resources, networking, digitally

    enabled-sensors, instruments, virtual organiza-

    tions, and observatories, along with an interop-

    erable suite o sotware services and tools. Tis

    technology is complemented by the interdisciplin-

    ary teams o proessionals that are responsible or

    its development, deployment and its use in trans-

    ormative approaches to scientic and engineering

    discovery and learning. Te vision also includes

    attention to the educational and workorce initia-

    tives necessary or both the creation and eective

    use o cyberinrastructure.

    Te ve chapters o this document set out

    NSFs cyberinrastructure vision. Te rst,A Call

    or Action, presents NSFs vision and commit-

    ment to a cyberinrastructure initiative. NSF will

    play a leadership role in the development and

    support o a comprehensive cyberinrastructure

    essential to 21st century advances in science and

    engineering research and education. Te vision o-cuses on a time rame o 2006-2010. Te mission

    is or cyberinrastructure to be human-centered,

    world-class, supportive o broadened participa-

    tion in science and engineering, sustainable, and

    stable but extensible. Te guiding principles are

    that investments will be science-driven, recognize

    the uniqueness o NSFs role, provide or inclusive

    strategic planning, enable U.S. leadership in sci-

    ence and engineering, promote partnerships and

    integration with investments made by others in all

    sectors, both national and international, and rely

    on strong merit review and on-going assessment,and a collaborative governance culture. Tis chap-

    ter goes on to review a set o more specic goals

    and strategies or NSFs cyberinrastructure initia-

    tive along with brie descriptions o the strategy to

    achieve those goals.

    High Perormance Computing (HPC) in sup-

    port o modeling, simulation, and extraction o

    knowledge rom huge data collections is increas-

    ingly essential to a broad range o scientic and

    engineering disciplines, oten multi-disciplinary

    (e.g. physics, biology, medicine, chemistry, cos-

    mology, computer science, mathematics), as well

    as multi-scalar in dimensions o space (e.g., nano-

    meters to light-years) time (e.g., picoseconds1 to

    billions o years), and complexity. A vision or

    petascale2 science and engineering or the aca-

    demic community, enabled by high perormance

    computing, is presented along with a series o

    principles that would be used to guide NSF sci-

    ence-driven HPC investments. Tis would result

    in a sustained petascale capable system deployed

    in the FY 2010 timerame. Te plan presented

    addresses HPC acquisition and deployment and

    various aspects o HPC sotware and tools, in

    addition to the necessary scalable applications that

    would execute on these HPC assets.

    An eective computing environment designed

    to meet the computational needs o a range o

    science and engineering applications will include a

    variety o computing systems with complementaryperormance capabilities. NSF will invest in lead-

    ership class environments in the 0.5-10 petascale

    perormance range. Strong partnerships involv-

    ing other ederal agencies, universities, industry

    and state government are also critical to success.

    NSF will also promote resource sharing between

    and among academic institutions to optimize the

    accessibility and use o HPC assets deployed and

    supported at the campus level. Supporting sot-

    ware services include the provision o intelligent

    development and problem-solving environments

    and tools. Tese tools are designed to provide im-provements in ease o use, reusability o modules,

    and portable perormance.

    exeCutive summAry

    1 A picosecond is 10-12 second2 A petascale is 1015 operations per second with

    comparable storage and networking capacity

    The image shows computed charge density for iron oxide (FeO) within the local density approximation, withspherical ions subtracted. The colors represent the spin density, showing the antiferromagnetic ordering.

  • 7/28/2019 CI Vision March07

    8/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 3 -

    National Science Foundation March 2007

    Researchers create cyberenvironmentssecure, easy-to-use interfaces to instruments, data, computing systems, networks,applications, analysis and visualization tools, and services.

    Data, Data Analysis, and Visualization are

    vital or progress in the increasingly data-intensive

    realm o science and engineering research and

    education. Any cogent plan addressing cyberinra-

    structure must address the phenomenal growth o

    data in all its various dimensions. Scientists and

    engineers are producing, accessing, analyzing, in-

    tegrating, storing and retrieving massive amounts

    o data daily. Further, this is a trend that is

    expected to see signicant growth in the very near

    uture as advances in sensors and sensor networks,

    high-throughput technologies and instrumenta-

    tion, automated data acquisition, computational

    modeling and simulation, and other methods and

    technologies materialize. Te anticipated growth

    in both the production and repurposing o digital

    data raises complex issues not only o scale andheterogeneity, but also o stewardship, curation

    and long-term access.

    Responding to the challenges and opportunities

    o a data-intensive world, NSF will pursue a vi-

    sion in which science and engineering digital data

    are routinely deposited in well-documented orm,

    are regularly and easily consulted and analyzed

    by specialist and non-specialist alike, are openly

    accessible while suitably protected, and are reliably

    preserved. o realize this vision, NSFs goals or

    2006-2010 are twoold: to catalyze the develop-

    ment o a system o science and engineering data

    collections that is open, extensible, and evolvable;

    and to support development o a new generation

    o tools and services or data discovery, integra-

    tion, visualization, analysis and preservation. Te

    resulting national digital data ramework will be

    an integral component in the national cyberin-

    rastructure ramework. It will consist o a range

    o data collections and managing organizations,

    networked together in a exible technical architec-

    ture using standard, open protocols and interaces,

    and designed to contribute to the emerging global

    inormation commons. It will be simultaneously

    local, regional, national and global in nature, and

    will evolve as science and engineering research and

    education needs change and as new science and

    engineering opportunities arise.

    Virtual Organizations or Distributed Com-

    munities, built upon cyberinrastructure, enable

    science and engineering communities to pursue

    their research and learning goals with dramatically

    relaxed constraints o time and distance. A virtual

    organization is created by a group o individuals

    whose members and resources may be dispersed

    geographically and/or temporally, yet who unc-

    tion as a coherent unit through the use o end-to-

    end cyberinrastructure systems. Tese CI systems

    provide shared access to centralized or distributed

  • 7/28/2019 CI Vision March07

    9/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 4 -

    National Science Foundation March 2007

    resources and services, oten in real-time. Such

    virtual organizations supporting distributed com-

    munities go by numerous names: collaboratory,

    co-laboratory, grid community, science gateway,

    science portal, and others. As such environments

    become more and more unctionally complete

    they oer new organizations or discovery and

    learning and bold new opportunities or broad-

    ened participation in science and engineering.

    Creating and sustaining eective virtual

    organizations, especially those spanning many

    traditional organizations, is a complex technical

    and social challenge. It requires an open tech-

    nological ramework consisting o, or example,

    applications, tools, middleware, remote access to

    experimental acilities, instruments and sensors,

    as well as monitoring and post-analysis capabili-ties. An operational ramework rom campus level

    to international scale is required, as well as a need

    or partnerships between the various cyberinra-

    structure stakeholders. Overall eectiveness also

    depends upon the appropriate social, governance,

    legal, economic and incentive structures. Forma-

    tive and longitudinal evaluation is also neces-

    sary both to inorm iterative design as well as to

    develop understanding o the impact o virtual

    organizations on enhancing the eectiveness o

    discovery and learning.

    Learning and Workorce Developmentop-

    portunities and requirements recognize that the

    ubiquitous and interconnected nature o cyberin-

    rastructure will change not only how we teach but

    also how we learn. Te uture will see increas-

    ingly open access to online educational resources

    including courseware, knowledge repositories,

    laboratories, and collaboration tools. Collabo-

    ratories or science gateways (instances o virtual

    organizations) created by research communities

    will also oer participation in authentic inquiry-

    based learning. Tese new modes and opportuni-

    ties to learn and to teach, covering K-12, post-

    secondary, the workorce and the general public,

    come with their own set o opportunities and

    challenges. New assessment techniques will have

    to be developed and understood; undergraduate

    curricula must be reinvented to ully exploit the

    capabilities made possible by cyberinrastructure;

    and the education o the proessionals that are

    being relied upon to support, develop and deploy

    uture generations o cyberinrastructure must be

    addressed. In addition, cyberinrastructure will

    have an impact on how business will be conductedand members o the workorce must have the

    capability to ully exploit the benets aorded by

    these new technologies.

    Cyberinrastructure-enhanced discovery and

    learning is especially exciting because o the op-

    portunities it aords or broadened participation

    and wider diversity along individual, geographical

    and institutional dimensions. o ully realize these

    opportunities NSF will identiy and address the

    barriers to utilization o cyberinrastructure tools,

    services, and resources; promote the training oaculty, educators, students, researchers and the

    public; and encourage programs that will explore

    and exploit cyberinrastructure, including taking

    advantage o the international connectivity it

    provides - particularly important as we prepare a

    globally engaged workorce.

  • 7/28/2019 CI Vision March07

    10/64

  • 7/28/2019 CI Vision March07

    11/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 6 -

    National Science Foundation March 2007

    I. Cyberinfrastructure Drivers and

    Opportunities

    How does a protein old? What happens tospace-time when two black holes collide? Whatimpact does species gene ow have on an ecologi-cal community? What are the key actors thatdrive climate change? Did one o the trillions ocollisions at the Large Hadron Collider produce aHiggs boson, the dark matter particle, or a blackhole? Can we create an individualized model oeach human being or personalized health caredelivery? How does major technological changeaect human behavior and structure complex so-cial relationships? What answers will we nd toquestions we have yet to ask in the very largedatasets that are being produced by telescopes,sensor networks, and other experimental acilities?

    Tese questions and many others are onlynow coming within our ability to answer becauseo advances in computing and related inorma-tion technology. Once used by a handul o eliteresearchers in a ew research communities onselect problems, advanced computing has becomeessential to uture progress across the rontier o

    science and engineering. Coupled with continu-ing improvements in microprocessor speeds,converging advances in networking, sotware, visu-alization, data systems and collaboration platormsare changing the way research and education areaccomplished.

    odays scientists and engineers need access tonew inormation technology capabilities, such asdistributed wired and wireless observing networkcomplexes, and sophisticated simulation toolsthat permit exploration o phenomena that cannever be observed or replicated by experiment.

    Computation oers new models o behavior andmodes o scientic discovery that greatly extendthe limited range o models that can be producedwith mathematics alone or example, chaoticbehavior. Fewer and ewer researchers working

    at the rontiers o knowledge can carry out theirwork without cyberinrastructure o one orm oranother.

    While hardware perormance has been growingexponentially with gate density doubling every18 months, storage capacity every 12 months,and network capability every 9 months it hasbecome clear that increasingly capable hardware isnot the only requirement or computation-enableddiscovery. Sophisticated sotware, visualizationtools, middleware and scientic applications cre-ated and used by interdisciplinary teams are criti-

    cal to turning ops, bytes and bits into scienticbreakthroughs. In addition to these technicalneeds, the exploration o new organizational mod-els and the creation o enabling policies, processes,and economic rameworks are also essential. Tecombined power o these capabilities and ap-proaches is necessary to advance the rontiers oscience and engineering, make seemingly intrac-table problems solvable, and pose proound newscientic questions.

    Te comprehensive inrastructure needed tocapitalize on dramatic advances in inormation

    technology has been termed cyberinrastructure(CI). Cyberinrastructure integrates hardware orcomputing, data and networks, digitally-enabledsensors, observatories and experimental acilities,and an interoperable suite o sotware and middle-ware services and tools. Investments in interdiscip-linary teams and cyberinrastructure proessionalswith expertise in algorithm development, systemoperations, and applications development arealso essential to exploit the ull power o cyberin-rastructure to create, disseminate, and preservescientic data, inormation and knowledge.

    For our decades, NSF has provided leader-ship in the scientic revolution made possibleby inormation technology (Appendices B andC). Trough investments ranging rom super-computing centers and the Internet to sotwareand algorithm development, inormation tech-

    The Terashake 2.1 simulation on the opposite page depicts a velocity wavefield as it propagates through the 3Dvelocity structure beneath Southern California. Red and yellow colors indicate regions of compression, whileblue and green colors show regions of dilation. Faint yellow (faults), red (roads), and blue (coast-line) lines addgeographical context.

    ChAPter 1CALL to ACtion

  • 7/28/2019 CI Vision March07

    12/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science Foundation March 2007

    nology has stimulated scientic breakthroughsacross all science and engineering elds. Mostrecently, NSFs Inormation echnology Research(IR) priority area sowed the seeds o broad andintensive collaboration among the computational,

    computer, and domain research communities thatsets the stage or this Call to Action.

    NSF is the only agency within the U.S. govern-ment that unds research and education across alldisciplines o science and engineering. Over thepast ve years, NSF has held community work-shops, commissioned blue-ribbon panels, and car-ried out extensive internal planning (Appendix B).Tus, it is strategically placed to leverage, coordi-nate and transition cyberinrastructure advances inone eld to all elds o research.

    Other ederal agencies, the administration,Congress, the private sector, and other nations areaware o the growing importance o cyberinra-structure to progress in science and engineering.Other ederal agencies have planned improvedcapabilities or specic disciplines, and in somecases to address interdisciplinary challenges.Other countries have also been making signicantprogress in scientic cyberinrastructure. Tus,

    the U.S. must engage in and actively benet romcyberinrastructure developments around theworld.

    Not only is the time ripe or a coordinated

    investment in cyberinrastructure, but progress atthe science and engineering rontiers depends onit. Our communities are in place and are poisedto respond to such an investment.

    Working with the science and engineering re-search and education communities and partneringwith other key stakeholders, NSF is ready to lead.

    II. Vision, Mission and Principles for

    Cyberinfrastructure

    A. Vision

    NSF will play a leadership role in the develop-ment and support o a comprehensive cyberin-rastructure essential to 21st century advancesin science and engineering research and educa-tion.

    B. Mission

    NSFs mission or cyberinrastructure (CI) is to:

    Develop a human-centered CI that is driven byscience and engineering research and education

    opportunities;Provide the science and engineering com-munities with access to world-class CI toolsand services, including those ocused on: highperormance computing; data, data analysisand visualization; networked resources and vir-tual organizations; and learning and workorcedevelopment;

    Promote a CI that serves as an agent orbroadening participation and strengthening thenations workorce in all areas o science andengineering;

    Provide a sustainable CI that is secure, efcient,reliable, accessible, usable, and interoperable,and that evolves as an essential national inra-structure or conducting science and engineer-ing research and education; and

    Create a stable but extensible CI environmentthat enables the research and education com-munities to contribute to the agencys statutorymission.

    Visualization of a molecular dynamics simulation of a double strandedDNA molecule as it enters a nanopore in a silicon nitride membrane.

    - 7 -

  • 7/28/2019 CI Vision March07

    13/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    National Science Foundation March 2007

    C. Principles

    Te ollowing principles will guide the agencysFY 2006 through FY 2010 investments:

    Science and engineering research and educa-tion are oundational drivers o CI.

    NSF has a unique leadership role in ormulat-ing and implementing a national CI agendaocused on advancing science and engineering.

    Inclusive strategic planning is required to eec-

    tively address CI needs across a broad spectrumo organizations, institutions, communities andindividuals, with input to the process providedthrough public comments, workshops, undedstudies, advisory committees, merit review andopen competitions.

    Strategic investments in CI resources andservices coupled with enabling policy and orga-nizational ramework are essential to continuedU.S. leadership in science and engineering.

    Te integration and sharing o cyberinrastruc-ture assets deployed and supported at national,regional, local, community and campus levelsrepresent the most eective way o construct-ing a comprehensive CI ecosystem suited tomeeting uture needs.

    Public and private national and internationalpartnerships that integrate CI users and provid-ers and benet NSFs research and educationcommunities are also essential or enabling

    next-generation science and engineering.

    Existing strengths, including research programsand CI acilities, serve as a oundation uponwhich to build a CI designed to meet the needso the broad science and engineering commu-

    nity.

    Merit review is essential or ensuring that thebest ideas are pursued in all areas o CI und-ing.

    Regular evaluation and assessment tailored toindividual projects is essential or ensuring ac-countability to all stakeholders.

    A collaborative CI governance and coordina-tion structure that includes representatives whocontribute to basic CI research, developmentand deployment, as well as those who use CI,is essential to ensure that CI is responsive to

    community needs and empowers research atthe rontier.

    III. Goals and Strategies

    NSFs vision and mission statements on CI needwell-dened goals and strategies to turn them intoreality. Te goals underlying these statements areprovided below, with each goal ollowed by a briedescription o the strategy to achieve the goal.

    Across the CI landscape, NSF will:

    Provide communities addressing the mostcomputationally challenging problems withaccess to a world-class, high perormancecomputing (HPC) environment through NSFacquisition and through exchange-o-serviceagreements with other entities, where pos-sible.

    NSFs investment strategy or the provision oCI resources and services will be linked to careulrequirements analyses o the computational needso research and education communities. NSFinvestments will be coordinated with those o

    other agencies in order to maximize access to thesecapabilities and to provide a range o representa-tive high perormance architectures.

    Broaden access to state-o-the-art computingresources, ocusing especially on institutionswith less capability and communities wherecomputational science is an emerging activ-ity.

    Robert Patterson demonstrates NCSAs 3D Visualization to Dr.Arden Bement, the Director of NSF, and others during the FY08

    NSF budget roll-out.

    - 8 -

  • 7/28/2019 CI Vision March07

    14/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 9 -

    National Science Foundation March 2007

    Building on the achievements o current CIservice providers and other NSF investments, theagency will work to make necessary computingresources more broadly available, paying particularattention to emerging and underserved communi-ties.

    Support the development and maintenanceo robust systems sotware, programmingtools, and applications needed to close the

    growing gap between peak perormance andsustained perormance on actual research

    codes, and to make the use o HPC systems,as well as novel architectures, easier andmore accessible.

    NSF will build on research in computer scienceand other research areas to provide science andengineering applications and problem-solving en-vironments that more eectively exploit innovativearchitectures and large-scale computing systems.NSF will continue and build on its existing col-laborations with other agencies in support o thedevelopment o HPC sotware and tools.

    Support the continued development, expan-sion, hardening and maintenance o end-to-end sotware systems user interaces,workow engines, science and engineeringapplications, data management, analysisand visualization tools, collaborative tools,and other sotware integrated into completescience and engineering systems via middle-ware in order to bring the ull power o anational cyberinrastructure to communitieso scientists and engineers.

    NCSAs Cobalt computing system uses a 3D cylindrical configuration to model the sediment discharge of a river into theocean and the initial stages of alluvial fan formation at the rivers mouth.

    Cyberinfrastructure will broaden access t o state-of-the art re-sources for learning and discovery, creating new opportunitiesfor participation by emerging and underserved communities.

  • 7/28/2019 CI Vision March07

    15/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 10 -

    National Science Foundation March 2007

    Tese investments will build on the sotwareproducts o current and ormer programs, and willleverage work in core computer science researchand development eorts supported by NSF andother ederal agencies.

    Support the development o the comput-ing proessionals, interdisciplinary teams,enabling policies and procedures, and neworganizational structures such as virtualorganizations, that are needed to achieve thescientifc breakthroughs made possible byadvanced CI, paying particular attention toopportunities to broaden the participation ounderrepresented groups.

    NSF will continue to improve its under-standing o how participants in its research and

    education communities, as well as the scienticworkorce, can use CI. For example, virtualorganizations empower communities o users tointeract, exchange inormation, and access andshare resources through tailored interaces. Someo NSFs investments will ocus on appropriatemechanisms or structures or use, while others willocus on how best to train uture users o CI. NSF

    will take advantage o the emerging communitiesassociated with CI that provide unique and specialopportunities or broadening participation in thescience and engineering enterprise.

    Support state-o-the-art innovation in datamanagement and distribution systems,including digital libraries and educationalenvironments that are expected to contributeto many o the scientifc breakthroughs o the

    21st century.

    NSF will oster communication among ore-ront data management and distribution systems,digital libraries, and other education environmentssponsored in its various directorates. NSF willensure that its eorts take advantage o innova-tion in large data management and distribution

    activities sponsored by other agencies and throughinternational eorts. Tese developments will playa critical role in decisions that NSF makes aboutstewardship o long-lived data.

    Support the design and development o theCI needed to realize the ull scientifc poten-

    The DANSE project at CalTech integrates new materials theory with high-performance computing, using data from facili-ties such as DOEs new Spallation Neutron Source in Oak Ridge, TN.

  • 7/28/2019 CI Vision March07

    16/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 11 -

    National Science Foundation March 2007

    tial o NSFs investments in tools and largeacilities, rom observatories and accelera-tors to sensor networks and remote observingsystems.

    NSFs investments in large acilities and othertools require new types o CI such as wirelesscontrol o networks o sensors in hostile environ-ments, rapid distribution and analysis o petascaledata sets around the world, adaptive knowledge-based control and sampling systems, and innova-tive visualization systems or collaboration. NSFwill ensure that these projects invest appropriatelyin CI capabilities, promoting the integrated andwidespread use o the unique services provided bythese and other acilities. In addition, NSFs CIprograms will be designed to serve the needs othese projects.

    Support the development and maintenanceo the increasingly sophisticated applica-tions needed to achieve the scientifc goals oresearch and education communities.

    Te applications needed to produce cutting-edge science and engineering have becomeincreasingly complex. Tey require teams, evencommunities, to develop and sustain wide andlong-term applicability, and they leverage under-lying sotware tools and increasingly common,persistent CI resources such as data repositories

    and authentication and authorization services.NSFs investments in applications will involve itsdirectorates and ofces that support domain-spe-cic science and engineering. Special attention willbe paid to the cross-disciplinary nature o much othe work.

    Invest in the high-risk/high-gain basic re-search in computer science, computing andstorage devices, mathematical algorithms,and the human/CI interaces that are criticalto powering the uture exponential growth inall aspects o computing, including hardware

    speed, storage, connectivity and scientifcproductivity.

    NSFs investments in operational CI must becoupled with vigorous research programs in thedirectorates to ensure that operational capabili-ties continue to expand and extend in the uture.Important among these programs are activities tounderstand how humans adopt and use CI. NSFis especially well-placed to oster collaborations

    among computer scientists; social, behavioral andeconomic scientists; and other domain scientistsand engineers to understand how humans can bestuse CI, in both research and education environ-ments.

    Provide a ramework that will sustainreliable, stable resources and services whileenabling the integration o new technologiesand research developments with a minimumo disruption to users.

    NSF will minimize disruption to users by real-izing a comprehensive CI with an architecture andramework that emphasizes interoperability andopen standards, thus providing exibility or up-grades, enhancements and evolutionary changes.Pre-planned arrangements or alternative CI avail-

    abilities during competitions, changeovers andupgrades to production operations and serviceswill be made, including cooperative arrangementswith other agencies.

    A strategy common to achieving all o thesegoals is partnering nationally and internation-ally, with other agencies, the private sector, andwith universities to achieve a worldwide CI thatis interoperable, exible, efcient, evolving andbroadly accessible. In particular, NSF will takea lead role in ormulating and implementing anational CI strategy.

    IV. Planning for Cyberinfrastructure

    o implement its cyberinrastructure vision,NSF will develop interdependent plans or eacho the ollowing aspects o CI, with emphasis ontheir integration to create a balanced science- andengineering-driven national CI:

    High Perormance Computing

    Data, Data Analysis, and Visualization

    Virtual Organizations or Distributed

    Communities, andLearning and Workorce Development.

    Others may be added at a later date.

    While these aspects are addressed separately asa means or organizing this document, the centralgoal is the development o a ully-integrated CIramework comprised o the balanced, seamless

  • 7/28/2019 CI Vision March07

    17/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 12 -

    National Science Foundation March 2007

    Researchers upgrade the software of an automated weather station that transmits data to help track theicebergs position in the Antarctic and reports on the microclimate of the ice surface.

    blending o these components. Tis will requireintegrative management structures (such as thenewly ormed Ofce o Cyberinrastructure, theNSF-wide Cyberinrastructure Council, and theCyberinrastructure Coordinators Committee), as

    well as science-driven, community-basedplanning and implementation processes thatspan all the elements o a truly comprehensive CIramework.

    Tese plans will be reviewed annually and willevolve over time, paced by the considerable rateo innovation in computing and communica-tion, and by the growing needs o the science andengineering community or state-o-the-art CI

    capabilities. Trough cycles o use-driven innova-tion, NSFs vision will become reality.

  • 7/28/2019 CI Vision March07

    18/64

  • 7/28/2019 CI Vision March07

    19/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 14 -

    National Science Foundation March 2007

    I. What Does High Performance

    Computing Offer Science and

    Engineering?

    What are the three-dimensional structures o allo the proteins encoded by the human genome,and how does structure inuence their unction ina human cell? What patterns o emergent behav-ior occur in models o very large societies? Howdo massive stars explode and produce the heavi-est elements in the periodic table? What sort oabrupt transitions can occur in Earths climate andecosystem structure? How do these transitionsoccur, and under what circumstances? I we coulddesign catalysts atom-by-atom, could we trans-orm industrial synthesis? What strategies mightbe developed to optimize management o complex

    inrastructure systems? What kind o languageprocessing can occur in large assemblages oneurons? Can we enable integrated planning andresponse to natural and man-made disasters thatprevent or minimize the loss o lie and property?Tese are just some o the important questionsthat researchers wish to answer using contempo-rary tools in a state-o-the-art High PerormanceComputing (HPC) environment.

    Using HPC-based applications, researchers

    study the properties o minerals at the extremetemperatures and pressures that occur deep withinthe Earth. Tey simulate the development ostructure in the early Universe. Tey probe thestructure o novel phases o matter such as thequark-gluon plasma. HPC capabilities enable themodeling o lie cycles that capture interdependen-

    The visualization above, created from data generated by a tornado simulation calculated on the NCSA computing cluster,shows the tornado by spheres colored according to pressure. Orange and blue tubes represent the rising and fallingairflow around the tornado.

    NCARs blueice supercomputer, shown on the opposite page, enables scientists to enhance the resolution and complexityof Earth system models, improve climate and weather research, and provide more accurate data to decision makers.

    ChAPter 2high PerFormAnCe ComPuting

    (2006-2010)

  • 7/28/2019 CI Vision March07

    20/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 15 -

    National Science Foundation March 2007

    cies across diverse disciplines and multiple scalesto create globally competitive manuacturingenterprise systems. And they examine the wayproteins old and vibrate ater they are synthesizedinside an organism. In act, sophisticated numeri-

    cal simulations permit scientists and engineers toperorm a wide range o in silico experiments thatwould otherwise be too difcult, too expensive, orimpossible to perorm in the laboratory.

    HPC systems and services are also essential tothe success o research conducted with sophisti-cated experimental tools. Without the waveormsproduced by the numerical simulation o blackhole collisions and other astrophysical events,gravitational wave signals cannot be extractedrom the data produced by the Laser Intererom-eter Gravitational Wave Observatory. High-resolu-

    tion seismic inversions rom the higher densityo broad-band seismic observations urnished bythe Earthscope project are necessary to determineshallow and deep Earth structure. Simultaneousintegrated computational and experimental test-ing is conducted on the Network or EarthquakeEngineering Simulation to improve seismic design

    o buildings and bridges. HPC is essential toextracting the signature o the Higgs boson andsupersymmetric particles two o the scienticdrivers o the Large Hadron Collider rom thepetabytes o data produced in the trillions o

    particle collisions.

    Science and engineering research and educa-tion enabled by state-o-the-art HPC tools havea direct bearing on the nations competitiveness.I investments in HPC are to have a long-termimpact on problems o national need, such asbioengineering, critical inrastructure protection(or example, the electric power grid), health care,manuacturing, nanotechnology, energy, andtransportation, then HPC tools must deliver highperormance capability or a wide range o scienceand engineering applications.

    A functioning ribosome, a complex of three la rge RNA molecule s and fif ty proteins with three mil lion atoms, issimulated on the Texas Advanced Computing Center computer.

  • 7/28/2019 CI Vision March07

    21/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 16 -

    National Science Foundation March 2007

    Results from the Parallel Climate Model, prepared fromdata in the Earth System Grid, depict wind vectors,surface pressure, seas surface temperature and sea iceconcentration.

    II. The Next Five Years: Creating

    a High Performance Computing

    Environment for Petascale Science

    and Engineering

    NSFs ve-year HPC goal is to enable petascalescience and engineering through the deploymentand support o a world-class HPC environmentcomprising the most capable combination o HPCassets available to the academic community. Tepetascale HPC environment will enable investiga-tions o computationally challenging problemsthat require computers operating at sustainedspeeds on actual research codes o 1015 oatingpoint operations per second (petaops) or thatwork with extremely large data sets on the order o1015 bytes (petabytes).

    Petascale HPC capabilities will permit research-ers to perorm simulations that are intrinsicallymulti-scale or that involve multiple simultaneousreactions, such as modeling the interplay amonggenes, microbes, and microbial communities andsimulating the interactions among the ocean,atmosphere, cryosphere and biosphere in Earthsystems models. In addition to addressing themost computationally challenging demands oscience and engineering, new and improvedHPC sotware services will make supercomputingplatorms supported by NSF and other partnerorganizations more efcient, more accessible, and

    easier to use.

    NSF will support the deployment o a well-en-gineered, scalable, HPC inrastructure designed toevolve as science and engineering research needschange. It will include a sufcient level o diversi-

    ty, both in architecture and scale o deployed HPCsystems, to realize the research and education goalso the broad science and engineering community.NSFs HPC investments will be complementedby its simultaneous investments in data analysisand visualization acilities essential to the eectivetransormation o data products into inormationand knowledge.

    Te ollowing principles will guide the agencysFY 2006 through FY 2010 investments:

    Science and engineering research and educa-

    tion priorities will drive HPC investments.Collaborative activities involving science andengineering researchers and private sectororganizations are needed to ensure that HPCsystems and services are optimally conguredto support petascale scientic computing.

    Researchers and educators require access toreliable, robust, production-quality HPCresources and services.

    HPC-related research and developmentadvances generated in the public and privatesectors, both domestic and oreign, must beleveraged to enrich HPC capabilities.

    Te development, implementation and annualupdate o an eective multi-year HPC strategyis crucial to the timely introduction o researchand development outcomes and innovations inHPC systems, sotware and services.

    NSFs implementation plan to create a petascaleenvironment includes the ollowing three interre-lated components:

    1). Specifcation, Acquisition, Deploymentand Operation o Science-Driven HPC Systems

    Architectures

    An eective computing environment designedto meet the computational needs o a range oscience and engineering applications will include avariety o computing systems with complementaryperormance capabilities. By 2010, the petascalecomputing environment available to the academicscience and engineering community is likely toconsist o: (i) a signicant number o systems with

  • 7/28/2019 CI Vision March07

    22/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 17 -

    National Science Foundation March 2007

    This numerical simulation, created on the NCSA Itanium Linux Cluster by international researchers, showsthe merger of two black holes and the ripples in space time that are born of t he merger.

    peak perormance in the 50-500 teraops range,deployed and supported at the local level by indi-vidual campuses and other research organizations;(ii) multiple systems with peak perormance o500+ teraops that support the work o thousands

    o researchers nationally; and, (iii) at least onesystem capable o delivering sustained peror-mance approaching 1015 oating point operationsper second on real applications that consume largeamounts o memory, and/or that work with verylarge data sets projects that demand the highestlevels o computing perormance. All NSF-de-ployed systems will be appropriately balanced andwill include core computational hardware, localstorage o sufcient capacity, and appropriate dataanalysis and visualization capabilities.

    Over the FY 2006-2010 period, NSF will ocuson HPC system acquisitions in the 100 teraopsto 10 petaops range, where strategic investmentson a national scale are necessary to ensure inter-national leadership in science and engineering.

    Since dierent science and engineering codes mayachieve optimal perormance on dierent HPCarchitectures, it is likely that by 2010 the NSF-supported HPC environment will include bothloosely coupled and tightly coupled systems, withseveral dierent memory models.

    o address the challenge o providing theresearch community with access to a range oHPC architectures within a constrained budget,a key element o NSFs strategy is to participatein resource-sharing with other ederal agencies. A

  • 7/28/2019 CI Vision March07

    23/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 18 -

    National Science Foundation March 2007

    Massachusetts Institute of Technology researchers are developingcomputational tools to analyze the structure of any protein, suchas the human ubiquitin hydrolase (shown), for knots.

    strengthened interagency partnership will ocus, tothe extent practicable, on ensuring shared accessto ederal leadership-class resources with dierentarchitectures, and on the coordination o invest-ments in HPC system acquisition and operation.

    Te Department o Energys Ofce o Scienceand National Nuclear Security Administrationhave very active programs in leadership comput-ing. Te Department o Deenses (DOD) HighPerormance Computing Modernization Ofce(HPCMOD) provides HPC resources and servicesor the DOD science and engineering community,while NASA is deploying signicant comput-ing systems that are also o interest to NSF PIs.NSF will explore enhanced coordination mecha-nisms with other appropriate ederal agencies tocapitalize on their common interests. It will seekopportunities to make coordinated and collab-

    orative investments in science-driven hardwarearchitectures in order to increase the diversity oarchitectures o leadership class systems availableto researchers and educators around the country,to promote sharing o lessons learned, and toprovide a richer HPC environment or the usercommunities supported by each agency.

    Strong partnerships involving universities, in-dustry and government are also critical to success.NSF will also promote resource sharing betweenand among academic institutions to optimize theaccessibility and use o HPC assets deployed andsupported at the campus level.

    In addition to leveraging the promise o PhaseIII o the Deense Advanced Research ProjectsAgency (DARPA)-sponsored High ProductivityComputing Systems (HPCS) program, the agencywill establish a discussion and collaboration orumor scientists and engineersincluding computa-tional and computer scientists and engineersandHPC system vendors, in order to ensure that HPCsystems are optimally congured to support state-o-the-art scientic computing. On the one hand,these discussions will keep NSF and the academiccommunity inormed about new products, prod-

    uct roadmap and technology challenges at variousvendor organizations. On the other, they will pro-vide HPC system vendors with insights into themajor concerns and needs o the academic scienceand engineering community. Tese activities willlead to better alignment between applications andhardware both by inuencing algorithm designand by inuencing system integration.

    2). Development and Maintenance o Sup-porting Sotware: New Design Tools, Peror-mance Modeling Tools, Systems Sotware, andFundamental Algorithms.

    Many o the HPC sotware and service buildingblocks in scientic computing are common to anumber o science and engineering applications.A supporting sotware and service inrastructurewill accelerate the development o the scienticapplication codes needed to solve challengingscientic problems, and will help insulate thesecodes rom the evolution o uture generations oHPC hardware.

    Supporting sotware services include theprovision o intelligent development and prob-lem-solving environments and tools. Tese tools

    are designed to provide improvements in ease ouse, reusability o modules, and portable peror-mance. ools and services that take advantage ocommonly-supported sotware tools can deliversimilar work environments across dierent HPCplatorms, greatly reducing the time-to-solutiono computationally-intensive research problemsby permitting local development o researchcodes that can then be rapidly transerred to, orincorporate services provided by, larger production

  • 7/28/2019 CI Vision March07

    24/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 19 -

    National Science Foundation March 2007

    environments. Tese tools, and workows builtrom collections o such tools, can also be pack-aged or more general use. Applications scientistsand engineers will also benet rom the develop-ment o new tools and approaches to debugging,

    perormance analysis, and perormance optimiza-tion.

    Specic applications depend on a broad classo numerical and non-numerical algorithms thatare widely used by many applications, includinglinear algebra, ast spectral transorms, optimiza-tion algorithms, multi-grid methods, adaptivemesh renement, symplectic integrators, andsorting and indexing routines. o date, improvedor new algorithms have been important contribu-tors to perormance improvements in science andengineering applications, the development o

    multi-grid solvers or elliptic partial dierentialequations being a prime example. Innovationsin algorithms will have a signicant impact onthe perormance o applications sotware. Tedevelopment o algorithms or dierent architec-tural environments is an essential component othe eort to develop portable, scalable, applica-tions sotware. Other important sotware servicesinclude libraries or communications services, suchas MPI and OpenMP.

    Te development and deployment o operatingsystems and compilers that scale to hundreds othousands o processors are also necessary. Tey

    must provide eective ault-tolerance and eec-tively insulate users rom parallelization, as well asprovide protection rom latency management andthread management issues. o test new develop-ments at large scales, operating systems and kernelresearchers and developers must have access to theinrastructure necessary to test their developmentsat scale.

    Te sotware provider community will be asource or: applied research and developmento supporting technologies; harvesting promis-ing supporting sotware technologies rom the

    research communities; perorming scalabil-ity/reliability tests to explore sotware viability;developing, hardening and maintaining sotwarewhere necessary; and acilitating the transition ocommercially viable sotware into the private sec-tor. It is anticipated that this community will alsosupport general sotware engineering consultingservices or science and engineering applications,and will provide sotware engineering consultingsupport to individual researchers and research and

    education teams as necessary.

    Te sotware provider community will beexpected to promote sotware interoperabilityamong the various components o the cyberinra-

    structure sotware stack, such as those generatedto provide modeling and simulation data, dataanalysis and visualization services, and networkedresources and virtual organization capabilities. (SeeChapters 3 and 4 in this document.) Tis will beaccomplished through the creation and utiliza-tion o appropriate sotware test harnesses and willensure that sufcient conguration controls arein place to support the range o HPC platormsused by the research and education community.Te applications community will identiy neededimprovements in supporting sotware and willprovide input and eedback on the quality o

    services provided.

    NSF will seek guidance on the evolution osotware support rom representatives o academia,ederal agencies and private sector organizations,including third party and system vendors. Teywill provide input on the strengths, weaknesses,opportunities and gaps in the sotware servicescurrently available to the science and engineeringresearch and education communities.

    o minimize duplication o eort and optimizethe value o HPC services provided to the scienceand engineering community, NSFs investments

    will be coordinated with those o other agencies.DOE currently invests in sotware inrastructurecenters through the Scientic Discovery throughAdvanced Computing (SciDAC) program, whileDARPAs investments in the HPCS program con-tribute signicant systems sotware and hardwareinnovations. NSF will seek to leverage and addvalue to ongoing DOE and DARPA eorts in thisarea.

  • 7/28/2019 CI Vision March07

    25/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 20 -

    National Science Foundation March 2007

    Two skulls of separate species of pterosaurs were scanned at the High-Resolution X-ray Computed Tomo-

    graphy Facility at The University of Texas at Austin and the data was then fed to DigiMorph digital libraryto produce 2-D and 3-D structural visualizations.

    3). Development and Maintenance o Por-table, Scalable Applications Sotware

    odays microprocessor-based terascale comput-ers place considerable demands on our ability tomanage parallelism, and to deliver large ractionso peak perormance. As the agency seeks tocreate a petascale computing environment, it willembrace the challenge o developing or convertingkey application codes to run eectively on newand evolving system architectures.

    Over the FY 2006 through 2010 period, NSFwill make signicant new investments in the de-velopment, hardening, enhancement and mainte-nance o scalable applications sotware, includingcommunity models, to exploit the ull potential ocurrent terascale and uture petascale systems ar-chitectures. Te creation o well-engineered, easy-to-use sotware will reduce the complexity andtime-to-solution o todays challenging scienticapplications. NSF will promote the incorpora-

    tion o sound sotware engineering approachesin existing widely-used research codes and in thedevelopment o new research codes. Multidiscip-linary teams o researchers will work togetherto create, modiy and optimize applications orcurrent and uture systems using perormancemodeling tools and simulators.

    Since the nature and genesis o science andengineering codes varies across the research land-scape, a successul programmatic eort in this areawill weave together several strands. A new activitywill be designed to take applications that have thepotential to be widely used within a community orcommunities, to harden these applications basedon modern sotware engineering practices, todevelop versions or the range o architectures thatscientists wish to use them on, to optimize themor modern HPC architectures, and to provideuser support.

  • 7/28/2019 CI Vision March07

    26/64

  • 7/28/2019 CI Vision March07

    27/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 22 -

    National Science Foundation March 2007

    An ar tist s conception (above) depicts fundamental NEON observa tory ins trumentation and sys tems as well aspotential spatial organization of the environmental measurements made by these instruments and systems.

    The image on the opposite page shows the action of t he enzyme cellulase on cellulose using the CHARMMcommunity code in a simulation carried out at SDSC. NREL will use the simulation to help develop strategies forefficient large-scale conversion of biomass into ethanol.

    I. A Wealth of Scientific

    Opportunities Afforded by

    Digital Data

    Science and engineering research and educationhave become increasingly data-intensive as a resulto the prolieration o digital technologies, instru-mentation, and pervasive networks through whichdata are collected, generated, shared and analyzed.Worldwide, scientists and engineers are produc-ing, accessing, analyzing, integrating and storingterabytes o digital data daily through experi-mentation, observation and simulation. More-over, the dynamic integration o data generatedthrough observation and simulation is enablingthe development o new scientic methods thatadapt intelligently to evolving conditions to revealnew understanding. Te enormous growth in theavailability and utility o scientic data is increas-ing scholarly research productivity, accelerating the

    transormation o research outcomes into productsand services, and enhancing the eectiveness olearning across the spectrum o human endeavor.

    New scientic opportunities are emerging romincreasingly eective data organization, access andusage. ogether with the growing availability andcapability o tools to mine, analyze and visual-ize data, the emerging data cyberinrastructureis revealing new knowledge and undamentalinsights. For example, analyses o DNA sequence

    data are providing remarkable insights into theorigins o man, revolutionizing our understand-ing o the major kingdoms o lie, and revealingstunning and previously unknown complexity inmicrobial communities. Sky surveys are changingour understanding o the earliest conditions o theuniverse and providing comprehensive views ophenomena ranging rom black holes to superno-vae. Researchers are monitoring socioeconomicdynamics over space and time to advance our

    ChAPter 3DAtA, DAtA AnALysis, AnD visuALizAtion

    (2006-2010)

  • 7/28/2019 CI Vision March07

    28/64

    - 23 -

    National Science Foundation March 2007

    The National Virtual Observatorys Sky Statistics Survey allows astrono-mers to get a fast inventory of astronomical objects from various catalogs.

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    understanding o individual and group behav-ior and its relationship to social, economic andpolitical structures. Using combinatorial methods,scientists and engineers are generating librarieso new materials and compounds or health and

    engineering, and environmental scientists andengineers are acquiring and analyzing streamingdata rom massive sensor networks to understandthe dynamics o complex ecosystems.

    In this dynamic research and educationenvironment, science and engineering data areconstantly being collected, created, deposited,accessed, analyzed and expanded in the pursuit onew knowledge. In the uture, U.S. internationalleadership in science and engineering will increas-ingly depend upon our ability to leverage thisreservoir o scientic data captured in digital orm,

    and to transorm these data into inormation andknowledge aided by sophisticated data mining,integration, analysis and visualization tools.

    Tis chapter sets orth a ramework in whichNSF will work with its partners in science andengineering public and private sector organiza-tions both oreign and domestic representing dataproducers, scientists, engineers, managers andusers alike to address data acquisition, access,usage, stewardship and management challenges ina comprehensive way.

    II. Definitions

    A. Data, Metadata and Ontologies

    In this document, data and digital data areused interchangeably to reer to data and inorma-tion stored in digital orm and accessed electroni-cally.

    Data. For the purposes o this document, dataare any and all complex data entities rom ob-servations, experiments, simulations, models,and higher order assemblies, along with the

    associated documentation needed to describeand interpret the data.

    Metadata. Metadata are a subset o data, andare data about data. Metadata summarize datacontent, context, structure, interrelationships,and provenance (inormation on history andorigins). Tey add relevance and purpose todata, and enable the identication o similardata in dierent data collections.

    Ontology. An ontology is the systematicdescription o a given phenomenon. It otenincludes a controlled vocabulary and relation-ships, captures nuances in meaning and enablesknowledge sharing and reuse.

    B. Data Collections

    Tis document adopts the denition o datacollection types provided in the NSB report onLong-Lived Digital Data Collections, where datacollections are characterized as being one o threeunctional types:

    Research Collections. Authors are individualinvestigators and investigator teams. Researchcollections are usually maintained to serve im-mediate group participants only or the lie o

    a project, and are typically subjected to limitedprocessing or curation. Data may not conormto any data standards.

    Resource Collections. Resource collectionsare authored by a community o investigators,oten within a domain o science or engineer-ing, and are oten developed with community-level standards. Budgets are oten intermediatein size. Lietime is between the mid- andlong-term.

  • 7/28/2019 CI Vision March07

    29/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 24 -

    National Science Foundation March 2007

    The GLORIAD network, an optical network ring aroundthe northern hemisphere, promotes new opportunities forcooperation and understanding for scientists, educators andstudents.

    Reerence Collections. Reerence collectionsare authored by and serve large segmentso the science and engineering communityand conorm to robust, well-established andcomprehensive standards, which oten lead to

    a universal standard. Budgets are large and areoten derived rom diverse sources with a viewto indenite support.

    Boundaries between the types are not rigid, andcollections originally established as research col-lections may evolve over time to become resourceand/or reerence collections. In this document,the term data collection is construed to includeone or more databases and their relevant tech-nological implementation. Data collections aremanaged by organizations and individuals withthe necessary expertise to structure them and to

    support their eective use.

    III. Developing a Coherent Data

    Cyberinfrastructure in a Complex

    Global Context

    Since data and data collections are owned ormanaged by a wide range o communities, orga-nizations and individuals around the world, NSFmust work in an evolving environment constantlybeing shaped by developing international andnational policies and treaties, community-spe-

    cic policies and approaches, institutional-levelprograms and initiatives, individual practices, andcontinually advancing technological capabilities.

    At the international level, a number o nationsand international organizations have already recog-nized the broad societal, economic, and scienticbenets that result rom open access to scienceand engineering digital data. In 2004, more than30 nations, including the United States, declaredtheir joint commitment to work toward the es-tablishment o common access regimes or digitalresearch data generated through public unding.Since the international exchange o scientic data,inormation and knowledge promises to signi-cantly increase the scope and scale o researchand its corresponding impact, these nations areworking together to dene the implementationsteps necessary to enable the global science andengineering system.

    Te U.S. community is engaged through theCommittee on Data or Science and echnology

    (CODAA). Te U.S. National Committee orCODAA (USNC/CODAA) is working withinternational CODAA partners, including theInternational Council or Science (ICSU), theInternational Council or Scientic and echni-cal Inormation (ICSI), the World Data Centers(WDCs) and others, to accelerate the develop-ment o a global open-access scientic data and

    inormation resource, through the construction oan online open access knowledge environment,as well as through targeted projects. Te GlobalInormation Commons or Science is a multi-stakeholder initiative arising out o the secondphase o the World Summit on the InormationSociety that can provide important opportunitiesor international coordination and cooperation.Te goals o this initiative include improvingunderstanding o the benets o access to scienticdata and inormation, promoting successul insti-tutional and legal models or providing sustainableaccess, and enhancing coordination among the

    many science and engineering stakeholders aroundthe world.

    A number o international science and engineer-ing communities have also been developing datamanagement and curation approaches or reer-ence and resource collections. For example, theinternational Consultative Committee or SpaceData Standards (CCSDS) dened an archivereerence model and service categories or the inter-

  • 7/28/2019 CI Vision March07

    30/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 25 -

    National Science Foundation March 2007

    Images produced by Montage on SDSC TeraGrid fromthe 2MASS all-sky survey, provide astronomers with newinsights into the large-scale structure of the Milky Way.

    mediate and long-term storage o digital data rel-evant to space missions. Tis eort produced theOpen Archival Inormation System (OAIS), nowadopted as the de acto standard or buildingdigital archives, and provided evidence that a com-

    munity-ocused activity can have much broaderimpact than originally intended. In anotherexample, the Inter-University Consortium orPolitical and Social Research (ICPSR) - a member-ship-based organization with over 500 membercolleges and universities around the world - main-tains and provides access to a vast archive o socialscience data. ICPSR serves as a content manage-ment organization, preserving relevant socialscience data and migrating them to new storagemedia as technology changes, and also providesuser support services. ICPSR recently announcedplans to establish an international standard or

    social science documentation. Similar activitiesin other communities are also underway. Clearly,NSF must maintain a presence in, support, andadd value to these ongoing international discus-sions and activities.

    Activities on an international scale are comple-mented by activities within nation states. In theUnited States, a number o organizations andcommunities o practice are exploring mechanismsto establish common approaches to digital data ac-cess, management and curation. For example, theResearch Library Group (RLG a not-or-protmembership organization representing libraries,

    archives and museums) and the U.S. NationalArchives and Records Administration (NARA asister agency whose mission is to provide directionand assistance to ederal agencies on records man-agement) are producing certication requirementsor establishing and selecting reliable digital inor-mation repositories. RLG and NARA intend theirresults to be standardized via the InternationalOrganization o Standardization (ISO) ArchivingSeries, and may impact all data collections types.Te National Institutes o Health (NIH) NationalCenter or Biotechnology Inormation plays animportant role in the management o genome data

    at the national level, supporting public databases,developing sotware tools or analyzing data, anddisseminating biomedical inormation.

    At the institutional level, colleges and uni-versities are developing approaches to digitaldata archiving, curation and analysis. Tey aresharing best practices to develop digital librariesthat collect, preserve, index and share researchand education material produced by aculty and

    other individuals within their organizations. Tetechnological implementations o these systemsare oten open-source and support interoperabilityamong their adopters. University-based researchlibraries and research librarians are positioned to

    make signicant contributions in this area, wherestandard mechanisms or access and maintenanceo scientic digital data may be derived rom exist-ing library standards developed or print material.Tese eorts are particularly important to NSFas the agency considers the implications o notonly making all data generated with NSF und-ing broadly accessible, but o also promoting theresponsible organization and management o thesedata so that they are widely usable.

    IV. The Next Five Years: Towards a

    National Digital Data Framework

    Motivated by a vision in which science andengineering digital data are routinely depositedin well-documented orm, are regularly and easilyconsulted and analyzed by specialists and non-spe-cialists alike, are openly accessible while suitablyprotected, and are reliably preserved, NSFs ve-year goal is twoold:

    o catalyze the development o a system oscience and engineering data collections that isopen, extensible and evolvable; and

    o support development o a new generationo tools and services acilitating data mining,

  • 7/28/2019 CI Vision March07

    31/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 26 -

    National Science Foundation March 2007

    The IRIS Seismic Monitor System allows scientists and others to monitorglobal earthquakes in near real-time, visit seismic stations world-wide,and search the web for earthquake information.

    integration, analysis, and visualization essen-tial to turning data into new knowledge andunderstanding.

    Te resulting national digital data rameworkwill be an integral component in the national

    cyberinrastructure ramework described in thisdocument. It will consist o a range o data col-lections and managing organizations, networkedtogether in a exible technical architecture usingstandard, open protocols and interaces, anddesigned to contribute to the emerging globalinormation commons. It will be simultaneouslylocal, regional, national and global in nature, andwill evolve as science and engineering research andeducation needs change and as new science andengineering opportunities arise. Widely acces-sible tools and services will permit scientists andengineers to access and manipulate these data to

    advance the science and engineering rontier.

    In print orm, the preservation process ishandled through a system o libraries and otherrepositories throughout the country and aroundthe globe. wo eatures o this print-based systemmake it robust. First, the diversity o businessmodels deriving support rom a variety o sourcesmeans that no single entity bears sole responsibil-ity or preservation, and the system is resilient tochanges in any particular sector. Second, thereis overlap in the collections, and redundancy ocontent reduces the potential or catastrophic loss

    o inormation.

    Te national data ramework is envisionedto provide an equally robust and diverse systemor digital data management and access. It willpromote interoperability between data collectionssupported and managed by a range o organiza-tions and organization types; provide or appropri-ate protection and reliable long-term preservationo digital data; deliver computational perormance,data reliability and movement through sharedtools, technologies and services; and accommo-date individual community preerences. NSF willalso develop a suite o coherent data policies thatemphasize open access and eective organizationand management o digital data, while respectingthe data needs and requirements within scienceand engineering domains.

    Te ollowing principles will guide the agencysFY 2006 through FY 2010 investments:

    Science and engineering research and educa-tion opportunities and priorities will motivateNSF investments in data cyberinrastructure.

    Science and engineering data generated withNSF unding will be readily accessible and eas-

    ily usable, and will be appropriately, responsi-bly and reliably preserved.

    Broad community engagement is essential tothe prioritization and evaluation o the utilityo scientic data collections, including the pos-sible evolution rom research to resource andreerence collection types.

    Continual exploitation o data in the creationo new knowledge requires that investigatorshave access to the tools and services necessaryto locate and access relevant data, and under-stand its structure sufciently to be able to

    interpret and (re)analyze what they nd.Te establishment o strong, reciprocal,international, interagency and public-privatepartnerships is essential to ensure all stakehold-ers are engaged in the stewardship o valuabledata assets. ransition plans, addressing issuessuch as media, stewardship and standards, willbe developed or valuable data assets, to protectdata and assure minimal disruption to thecommunity during transition periods.

    Mechanisms will be created to share datastewardship best practices between nations,communities, organizations and individuals.

    In light o legal, ethical and national securityconcerns associated with certain types o data,mechanisms essential to the development oboth statistical and technical ways to protectprivacy and condentiality will be supported.

  • 7/28/2019 CI Vision March07

    32/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 27 -

    National Science Foundation March 2007

    Researchers check functionality and performance of the Compact Muon Solenoid detector at CERN before its closure.Built on the Large Hadron Collider, it provides a magnetic field of 4T.

    o date, challenges associated with eectivestewardship and preservation o scientic data

    have been more tractable when addressed throughcommunities o practice that may derive supportrom a range o sources. For example, NSF sup-ports the Incorporated Research Institutions orSeismology (IRIS) consortium to manage seismol-ogy data. Jointly with NIH and DOE, the agencysupports the Protein Data Bank to manage dataon the three-dimensional structures o proteinsand nucleic acids. Multiple agencies support theUniversity Consortium or Atmospheric Research,an organization that has provided access to at-mospheric and oceanographic data sets, simula-tions, and outcomes extending back to the 1930s

    through the National Center or AtmosphericResearch.

    Existing collections and managing organizationmodels reect dierences in culture and practicewithin the science and engineering community.As community proxies, data collections and theirmanaging organizations can provide a ocus orthe development and dissemination o appropri-

    ate standards or data and metadata content andormat, guided by an appropriate community-dened governance approach. Tis is not a staticprocess, as new disciplinary elds and approaches,data types, organizational models and inormation

    strategies inexorably emerge. Tis is discussed indetail in the Long-Lived Digital Data Collectionsreport o the National Science Board.

    Since data are held by many ederal agen-cies, commercial and non-prot organizations,and international entities, NSF will oster theestablishment o interagency, public-private andinternational consortia charged with providingstewardship or digital data collections to pro-mote interoperability across data collections. Teagency will work with the broad community oscience and engineering producers, managers,

    scientists and users to develop a common con-ceptual ramework. A ull range o mechanismswill be used to identiy and build upon commonground across domain communities and managingorganizations, engaging all stakeholders. Activitieswill include: the support o new projects; devel-opment and implementation o evaluation andassessment criteria that, among other things, reveallessons learned across communities; support o

    A. A Coherent Organizational Framework -Data Collections and Managing Organizations

  • 7/28/2019 CI Vision March07

    33/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 28 -

    National Science Foundation March 2007

    A simulated event of the col lision of t wo pro tons in the ATLAS experi-ment. The colors of the tracks emanating from the center show thedifferent types of particles emerging from the collision.

    community and intercommunity workshops; andthe development o strong partnerships with otherstakeholder organizations. Stakeholders in theseactivities include data authors, data managers, datascientists and engineers, and data users represent-

    ing a diverse range o communities and organiza-tions, including universities and research librar-ies, government agencies, content managementorganizations and data centers, and industry.

    o identiy and promote lessons learned acrossmanaging organizations, NSF will continue topromote the coalescence o appropriate collec-tions with overlapping interests, approaches andservices. Tis reduces data-driven ragmentationo science and engineering domains. Progress isalready being made in some areas. For example,NSF has been working with the environmental

    science and engineering community to promotecollaboration across disciplines ranging rom ecol-ogy and hydrology to environmental engineering.Tis has resulted in the emergence o commoncyberinrastructure elements and new interdiscip-linary science and engineering opportunities.

    B. Developing A Flexible Technological Archi-tecture

    From a technological perspective, the nationaldata ramework must provide or reliable preserva-tion, access, analysis, interoperability, and datamovement, possibly using a web or grid services

    distributed environment. Te architecture mustuse standard open protocols and interaces toenable the broadest use by multiple communities.It must acilitate user access, analysis and visualiza-tion o data, addressing issues such as authentica-tion, authorization and other security concerns,and data acquisition, mining, integration, analysisand visualization. It must also support complexworkows enabling data discovery. Such an archi-tecture can be visualized as a number o layers pro-viding dierent capabilities to the user, includingdata management, analysis, collaboration tools,and community portals. Te connections among

    these layers must be transparent to the end user,and services must be available as modular unitsresponsive to individual or community needs. Tesystem is likely to be implemented as a series odistributed applications and operations supportedby a number o organizations and institutions dis-tributed throughout the country. It must provideor the replication o data resources to reduce thepotential or catastrophic loss o digital inorma-tion through repeated cycles o systems migration

    and all other causes since, unlike printed records,the media on which digital data are stored and thestructures o the data are relatively ragile.

    High quality metadata, which summarize data

    content, context, structure, interrelationships, andprovenance (inormation on history and origins),are critical to successul inormation management,annotation, integration and analysis. Metadatatake on an increasingly important role when ad-dressing issues associated with the combination odata rom experiments, observations and simula-tions. In these cases, product data sets requiremetadata that describe, or example, relevantcollection techniques, simulation codes or point-ers to archived copies o simulation codes, andcodes used to process, aggregate or transorm data.Tese metadata are essential to create new knowl-

    edge and to meet the reproducibility imperativeo modern science. Metadata are oten associatedwith data via markup languages, representinga consensus around a controlled vocabulary todescribe phenomena o interest to the commu-nity, and allowing detailed annotations o data tobe embedded within a data set. Because there isoten little awareness o markup language develop-ment activities within science and engineeringcommunities, eort is expended reinventing whatcould be adopted or adapted rom elsewhere. Sci-entists and engineers thereore need access to toolsand services that help ensure that metadata areautomatically captured or created in real-time.

  • 7/28/2019 CI Vision March07

    34/64

    CyberinfrastruCture Visionfor 21st CenturyDisCoVery

    - 29 -

    National Science Foundation March 2007

    Eective data analysis tools apply computation-al techniques to extract new knowledge througha better understanding o the data and its redun-dancies and relationships by ltering extraneousinormation and by revealing previously unseen

    patterns. For example, the Large Hadron Col-lider at CERN generates such massive data setsthat the detection o both expected events, suchas the Higgs boson, and unexpected phenomenarequire the development o new algorithms, bothto manage data and to analyze it. Algorithmsand their implementations must be developed orstatistical sampling, or visualization, to enable thestorage, movement and preservation o enormousquantities o data, and to address other unoreseenproblems certain to arise.

    Scientic visualization, including not just static

    images but also animation and interaction, leadsto better analysis and enhanced understanding.Currently, many visualization systems are do-main or application-specic and require a certaincommitment to understanding or learning touse them. Making visualization services moretransparent to the user lowers the threshold o us-ability and accessibility, and makes it possible ora wider range o users to explore a data collection.Analysis o data streams also introduces problemsin data visualization and may require new ap-proaches or representing massive, heterogeneousdata streams.

    Deriving knowledge rom large data setspresents specic scaling problems due to the sheernumber o items, dimensions, sources, users, anddisparate user communities. Te human ability toprocess visual inormation can augment analysis,especially when analytic results are presented initerative and interactive ways. Visual analytics, thescience o analytical reasoning enabled by interac-tive visual interaces, can be used to synthesize theinormation content and derive insight rom mas-sive, dynamic, ambiguous, and even conictingdata. Suitable ully interactive visualizations helpabsorb vast amounts o data directly, to enhance

    ones ability to interpret and analyze otherwiseoverwhelming data. Researchers can thus detectthe expected and discover the unexpected, uncov-ering hidden associations and deriving knowledgerom inormation. As an added benet, theirinsights are more easily and eectively communi-cated to others.

    Creating and deploying visualization servicesrequires new rameworks or distributed applica-

    tions. In common with other cyberinrastructurecomponents, visualization requires easy-to-use,modular, extensible applications that capitalize onthe reuse o existing technology. odays successulanalysis and visualization applications use a pipe-

    line, component-based system on a single machineor across a small number o machines. Extendingto the broader distributed, heterogeneous cyber-inrastructure system will require new interacesand work in undamental graphics and visualiza-tion algorithms that can be run across remote anddistributed settings.

    o address this range o needs or data tools andservices, NSF will work with the broad commu-nity to identiy and prioritize needs. In makinginvestments, NSF will complement private sectoreorts, or example, those producing sophisticated

    indexing and search tools and packaging them asdata services. NSF will support projects to con-duct applied research and development o promis-ing, interoperable