35
Scientific Data Preservation and Access Needs: Looking Forward Brad Hemminger [email protected] School of Information and Library Science University of North Carolina at Chapel Hill

Scientific Data Preservation and Access Needs: Looking Forward Brad Hemminger [email protected] School of Information and Library Science University of North

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Scientific Data Preservation and Access Needs: Looking Forward

Brad [email protected]

School of Information and Library Science

University of North Carolina at Chapel Hill

Three stories…

• Astronomy

• Medical Imaging

• Genetics

Astronomy Data Growth

• From glass plates to CCDs– detectors follow Moore’s law

• The result: a data tsunami– available data doubles every two years

• Telescope growth– 30X glass (concentration)– 3000X in pixels (resolution)

• Single images– 16Kx16K pixels

• Large Synoptic Survey Telescope– wide field imaging at 5 terabytes/night

Source: Alex Szalay/Jim Gray

Large Synoptic Survey Telescope (LSST)• Top project of the astronomy decadal survey• Celestial cinematography

– 2 gigapixel detector for wide field imaging

• Science– beyond the standard model

• non-baryonic dark matter• non-zero and neutrino oscillations

– observation targets• near Earth object survey• weak lensing of wide fields• supernovae measurements

• Features– 7 square degree field/6.9 meter effective aperture– > 5 TB of data/night from a mountain in Chile

Distributed Virtual Astronomy

• Capabilities– homogeneous, multi-wavelength data– observations of millions of objects

• mega-sky surveys (2MASS, SLOAN, …)

• Initiatives– U.S. National Virtual Observatory (NVO)

• Caltech, JHU, ALMA, HST, …

– EU Astrophysical Virtual Observatory (AVO)• ESO, CNRS, CDS, …

• Grid data mining and archives– discovering significant patterns

• analysis of rich image/catalog databases

– understanding complex astrophysical systems • integrated data/large numerical simulations

HST Data Access

Biomedical Imaging Challenges

Source: Chris Johnson, Utah and Art Toga, UCLA

Medical Imaging Needs

• Many imaging modalities exist in medicine. Most are based on raster scanning (pixel matrix represents a scanned image plane).– 2D slices (Xray)– spatial series (volumes, CT,MRI, US)– time series (videos, heart studies, US)

• Radiology example: multi-slice CT scanner. Each slice is 512x512 pixels with up to a thousand slices in a study (0.5 Gigabytes per study).

Genetics

• Genetic Sequences are very simplest representation: as series of characters (text) representing nucleotide sequences.

• There are many more complex objects, including images…

Data Heterogeneity and Complexity

DiseaseDisease

DiseaseDrug

DiseaseClinical

trialPhenotype

ProteinProtein

StructureProtein

SequenceP-P

interactions

Proteome

Gene sequenceGenome

sequence

Gene expressionGene

expression

homology

Genomic, proteomic, transcriptomic, metabalomic, protein-protein interactions, regulatory bio-networks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), …

Source: Carole Goble (Manchester)

Gene Expression and Microarrays

• Concurrent evaluation– expression levels for thousands of genes

• Photolithography– up to 500K 10-20 micron cells

• each containing millions of identical DNA molecules

• Image capture and analysis– laser scanning and intensity calculation

Source: Affymetrix

Quantitative Begets Qualitative Change

Why is it important to capture?

• Previous research was documented in scientific literature and books. Increasingly though, our theories and methods are based on empirical measurements. Data gathered or sampled from our environment. Without a record of this data preserved, we cannot verify previous work, or build on existing work.

Memex: Still Prescient

“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”

Vannevar Bush

“As We May Think,” 1945

Human-Computer SymbiosisIt seems reasonable to envision, for a time 10 or 15 years hence, a 'thinking center' that will incorporate the functions of present-day libraries together with anticipated advances in information storage and retrieval.

The picture readily enlarges itself into a network of such centers, connected to one another by wide-band communication lines and to individual users by leased-wire services. In such a system, the speed of the computers would be balanced, and the cost of the gigantic memories and the sophisticated programs would be divided by the number of users. J.C.R. Licklider, 1960

21st Century Challenges• The three fold way

– theory and scholarship– experiment and measurement– computation and analysis

• Supported by– distributed, multidisciplinary teams– multimodal collaboration systems– distributed, large scale data sources– leading edge computing systems– distributed experimental facilities

• Socialization and community– multidisciplinary groups– geographic distribution– new enabling technologies– creation of 21st century IT infrastructure

• sustainable, multidisciplinary communities• “Come as you are” response

Th

eory

Exp

erim

ent

Computation

Effect of Technology on Science

• Theories• Computational models • Real World Observations/Measurements

Most science areas had theories, with limited sensor measurements.

With today’s technology we can increasingly acquire mountains of sensor measurements and we have computational models to check against.

Example Changes…

• High Energy Physics: colliders, lasers

• Astronomy, large CCD based telescopes, virtual arrays, space telescopes

• Medical Imaging: multitudes of scanning techniques with increasing resolution. Computational anatomical models

• Genetics: measurements of nucleotides, proteins, small molecules, etc.

Types of Challenges

• Technical– Data storage, computing power, data ingest.

• Knowledge (scholarly communications) – How do we share information (different terms,

different languages)– How do we preserve?

• Medical imaging formats last 10-20 years (larger capital investment, clinical care)

• Commercial sequencer data format no longer exists 2 years after product introduced (rapid technology changes, research only settings)

Technical Challenges: The Data Tsunami

• Many sources– agricultural– biomedical– environmental– engineering– manufacturing– financial– social and policy– historical

• Many causes and enablers– increased detector resolution– increased storage capability

• The challenge: extracting insight!

We Are Here!

Source: Robert Morris, IBM

Sensor Data Overload

Storage: Qualitative Change

5 MB in 195680 GB in 2004

1972

Storage: in practical terms

• Megabyte– a small novel

• Gigabyte– a pickup truck filled with paper or a DVD

• Terabyte: one thousand gigabytes – ~$1000 today– the text in one million books– entire U.S. Library of Congress is ~ten terabytes of text

• Petabyte: one thousand terabytes– 1-2 petabytes equals all academic research library holdings

• coming soon to a pocket near you!– soon routinely generated annually by many scientific instruments

• Exabyte: one thousand petabytes– 5 exabytes of words spoken in the history of humanity

• See www.sims.berkeley.edu/research/projects/how-much-info-2003/

Source: Hal Varian, UC-Berkeley

Knowledge Challenges:preservation requires standards

• Storage formats – Media (CDROM , DVD, tapes)– File formats (PDF, JPEG, MPEG)

• Standards that define meaning for particular domain (metadata, controlled vocabularies, taxonomies). Examples from medical and biological sciences: MeSH, DICOM, MIAME, GO, caBIG.

Who pursues standards?

• Users, i.e. scientists (GO, multitudes of domain specific examples)

• Manufacturers (companies making products) – Storage media (CDROM, DVD, DVD 2nd

generation not quite yet )– Knowledge standards not so frequently

pursued, more often in conjunction with push from user community (DICOM, MIAME)

• Government (MeSH, GenBank, caBIG)

Three Critical Steps to Success

• The scholarly communities must develop standards for communicating knowledge, and for the long term preservation of important descriptive information. I.e. taxonomies, controlled vocabularies.

• Common public repositories (can be many centers federated as one logical one) must be setup to store, preserve and provide access (Genbank).

• Change behavior (bring about usage) by– mandating by funding agency (NIH, NSF)– Requirement for publication (GenBank for sequences).

What role does the government play?

• Has developed standards in areas where significant support was provided (medicine and science via NLM, i.e. Medline, Mesh, UMLS, GenBank, Entrez, etc). And future (NIH caBIG, etc.).

• Successful with high cost shared resources (colliders (CERN), astronomy telescopes, etc.).

What role should the government play?

• Should government funded grants require deposit of scientific data in repositories? Of research papers in public repositories (PubMed).

• Should the government build and/or fund the other repositories and their maintenance?

• Should individual grants receive more money to support the publication, annotation and deposit of research results?

• What about the many other areas not addressed (Google, not the government is digitizing literature).

PITAC Report Contents• Computational Science: Ensuring

America’s Competitiveness 1. A Wake-up Call: The Challenges to U.S.

Preeminence and Competitiveness2. Medieval or Modern? Research and Education

Structures for the 21st Century3. Multi-decade Roadmap for Computational Science4. Sustained Infrastructure for Discovery and

Competitiveness5. Research and Development Challenges

• Two key appendices– Examples of Computational Science at Work– Computational Science Warnings – A Message

Rarely Heeded

• Available at www.nitrd.gov

PITAC Recommendation

• The Federal government must implement coordinated, long-term computational science programs that include funding for interconnecting the software sustainability centers, national data and software repositories, and national high-end leadership centers with the researchers who use those resources, forming a balanced, coherent system that also includes regional and local resources.

• Such funding methods are customary practice in research communities that use scientific instruments such as light sources and telescopes, increasingly in data-centered communities such as those that use the genome database, and in the national defense sector.

Additional Challenges

In addition to the lack of an overall semantic interoperable framework for multidisciplinary research and public repositories…

• Motivation for research labs to adhere to standards, especially when storing and describing data. (example UNC stories).

• Ownership, provenance• Privacy and security (IRB approval for future

research)• Indexing, data mining

Challenges for Universities • Multiple cultures

– arts, humanities and social sciences– sciences and engineering

• Many scholarly communication approaches– books, monographs, journals, conferences

• access time, priority and intellectual property– multiple media and expression

• text, audio, video, artifacts, performances, …– primary and secondary source materials– professional societies and private publishers

• Institutional repositories– multiple visions and roles

• digital archives and/or alternative publication venues– research and education

• access modes and goals, not just articles or books• longitudinal access and lifelong learning

– what and how much to save• declining cost of storage and simplicity of deposit

Computing History

• 1890-1945– mechanical, relay– 7 year doubling

• 1945-1985– tube, transistor,..– 2.3 year doubling

• 1985-2003– microprocessor– 1 year doubling

• Exponentials– chip transistor density: 2X in ~18 months– WAN bandwidth: 64X in two years– storage: 7X in two years– graphics: 100X in three years

0

0

1

1,000

1,000,000

1,000,000,000

1880 1900 1920 1940 1960 1980 2000

Doubled every 7.5

years

Doubled every 2.3

years

Doubles every year

Operations per second/$

Source: Jim Gray

MicrocomputerRevolution

4K bit core plane

Computing Power Trends

http://www.transhumanist.com/volume1/moravec.htm

Identify Genes

Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4

Predictive Disease Susceptibility

Physiology

Metabolism Endocrine

Proteome

Immune Transcriptome

BiomarkerSignatures

Morphometrics

Pharmacokinetics

EthnicityEnvironment

AgeGender

Example: Linking Genotype and Phentotype to study diseases

Source: Terry Magnuson