View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Scientific Data Preservation and Access Needs: Looking Forward
Brad [email protected]
School of Information and Library Science
University of North Carolina at Chapel Hill
Astronomy Data Growth
• From glass plates to CCDs– detectors follow Moore’s law
• The result: a data tsunami– available data doubles every two years
• Telescope growth– 30X glass (concentration)– 3000X in pixels (resolution)
• Single images– 16Kx16K pixels
• Large Synoptic Survey Telescope– wide field imaging at 5 terabytes/night
Source: Alex Szalay/Jim Gray
Large Synoptic Survey Telescope (LSST)• Top project of the astronomy decadal survey• Celestial cinematography
– 2 gigapixel detector for wide field imaging
• Science– beyond the standard model
• non-baryonic dark matter• non-zero and neutrino oscillations
– observation targets• near Earth object survey• weak lensing of wide fields• supernovae measurements
• Features– 7 square degree field/6.9 meter effective aperture– > 5 TB of data/night from a mountain in Chile
Distributed Virtual Astronomy
• Capabilities– homogeneous, multi-wavelength data– observations of millions of objects
• mega-sky surveys (2MASS, SLOAN, …)
• Initiatives– U.S. National Virtual Observatory (NVO)
• Caltech, JHU, ALMA, HST, …
– EU Astrophysical Virtual Observatory (AVO)• ESO, CNRS, CDS, …
• Grid data mining and archives– discovering significant patterns
• analysis of rich image/catalog databases
– understanding complex astrophysical systems • integrated data/large numerical simulations
HST Data Access
Medical Imaging Needs
• Many imaging modalities exist in medicine. Most are based on raster scanning (pixel matrix represents a scanned image plane).– 2D slices (Xray)– spatial series (volumes, CT,MRI, US)– time series (videos, heart studies, US)
• Radiology example: multi-slice CT scanner. Each slice is 512x512 pixels with up to a thousand slices in a study (0.5 Gigabytes per study).
Genetics
• Genetic Sequences are very simplest representation: as series of characters (text) representing nucleotide sequences.
• There are many more complex objects, including images…
Data Heterogeneity and Complexity
DiseaseDisease
DiseaseDrug
DiseaseClinical
trialPhenotype
ProteinProtein
StructureProtein
SequenceP-P
interactions
Proteome
Gene sequenceGenome
sequence
Gene expressionGene
expression
homology
Genomic, proteomic, transcriptomic, metabalomic, protein-protein interactions, regulatory bio-networks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), …
Source: Carole Goble (Manchester)
Gene Expression and Microarrays
• Concurrent evaluation– expression levels for thousands of genes
• Photolithography– up to 500K 10-20 micron cells
• each containing millions of identical DNA molecules
• Image capture and analysis– laser scanning and intensity calculation
Source: Affymetrix
Why is it important to capture?
• Previous research was documented in scientific literature and books. Increasingly though, our theories and methods are based on empirical measurements. Data gathered or sampled from our environment. Without a record of this data preserved, we cannot verify previous work, or build on existing work.
Memex: Still Prescient
“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”
Vannevar Bush
“As We May Think,” 1945
Human-Computer SymbiosisIt seems reasonable to envision, for a time 10 or 15 years hence, a 'thinking center' that will incorporate the functions of present-day libraries together with anticipated advances in information storage and retrieval.
The picture readily enlarges itself into a network of such centers, connected to one another by wide-band communication lines and to individual users by leased-wire services. In such a system, the speed of the computers would be balanced, and the cost of the gigantic memories and the sophisticated programs would be divided by the number of users. J.C.R. Licklider, 1960
21st Century Challenges• The three fold way
– theory and scholarship– experiment and measurement– computation and analysis
• Supported by– distributed, multidisciplinary teams– multimodal collaboration systems– distributed, large scale data sources– leading edge computing systems– distributed experimental facilities
• Socialization and community– multidisciplinary groups– geographic distribution– new enabling technologies– creation of 21st century IT infrastructure
• sustainable, multidisciplinary communities• “Come as you are” response
Th
eory
Exp
erim
ent
Computation
Effect of Technology on Science
• Theories• Computational models • Real World Observations/Measurements
Most science areas had theories, with limited sensor measurements.
With today’s technology we can increasingly acquire mountains of sensor measurements and we have computational models to check against.
Example Changes…
• High Energy Physics: colliders, lasers
• Astronomy, large CCD based telescopes, virtual arrays, space telescopes
• Medical Imaging: multitudes of scanning techniques with increasing resolution. Computational anatomical models
• Genetics: measurements of nucleotides, proteins, small molecules, etc.
Types of Challenges
• Technical– Data storage, computing power, data ingest.
• Knowledge (scholarly communications) – How do we share information (different terms,
different languages)– How do we preserve?
• Medical imaging formats last 10-20 years (larger capital investment, clinical care)
• Commercial sequencer data format no longer exists 2 years after product introduced (rapid technology changes, research only settings)
Technical Challenges: The Data Tsunami
• Many sources– agricultural– biomedical– environmental– engineering– manufacturing– financial– social and policy– historical
• Many causes and enablers– increased detector resolution– increased storage capability
• The challenge: extracting insight!
We Are Here!
Storage: in practical terms
• Megabyte– a small novel
• Gigabyte– a pickup truck filled with paper or a DVD
• Terabyte: one thousand gigabytes – ~$1000 today– the text in one million books– entire U.S. Library of Congress is ~ten terabytes of text
• Petabyte: one thousand terabytes– 1-2 petabytes equals all academic research library holdings
• coming soon to a pocket near you!– soon routinely generated annually by many scientific instruments
• Exabyte: one thousand petabytes– 5 exabytes of words spoken in the history of humanity
• See www.sims.berkeley.edu/research/projects/how-much-info-2003/
Source: Hal Varian, UC-Berkeley
Knowledge Challenges:preservation requires standards
• Storage formats – Media (CDROM , DVD, tapes)– File formats (PDF, JPEG, MPEG)
• Standards that define meaning for particular domain (metadata, controlled vocabularies, taxonomies). Examples from medical and biological sciences: MeSH, DICOM, MIAME, GO, caBIG.
Who pursues standards?
• Users, i.e. scientists (GO, multitudes of domain specific examples)
• Manufacturers (companies making products) – Storage media (CDROM, DVD, DVD 2nd
generation not quite yet )– Knowledge standards not so frequently
pursued, more often in conjunction with push from user community (DICOM, MIAME)
• Government (MeSH, GenBank, caBIG)
Three Critical Steps to Success
• The scholarly communities must develop standards for communicating knowledge, and for the long term preservation of important descriptive information. I.e. taxonomies, controlled vocabularies.
• Common public repositories (can be many centers federated as one logical one) must be setup to store, preserve and provide access (Genbank).
• Change behavior (bring about usage) by– mandating by funding agency (NIH, NSF)– Requirement for publication (GenBank for sequences).
What role does the government play?
• Has developed standards in areas where significant support was provided (medicine and science via NLM, i.e. Medline, Mesh, UMLS, GenBank, Entrez, etc). And future (NIH caBIG, etc.).
• Successful with high cost shared resources (colliders (CERN), astronomy telescopes, etc.).
What role should the government play?
• Should government funded grants require deposit of scientific data in repositories? Of research papers in public repositories (PubMed).
• Should the government build and/or fund the other repositories and their maintenance?
• Should individual grants receive more money to support the publication, annotation and deposit of research results?
• What about the many other areas not addressed (Google, not the government is digitizing literature).
PITAC Report Contents• Computational Science: Ensuring
America’s Competitiveness 1. A Wake-up Call: The Challenges to U.S.
Preeminence and Competitiveness2. Medieval or Modern? Research and Education
Structures for the 21st Century3. Multi-decade Roadmap for Computational Science4. Sustained Infrastructure for Discovery and
Competitiveness5. Research and Development Challenges
• Two key appendices– Examples of Computational Science at Work– Computational Science Warnings – A Message
Rarely Heeded
• Available at www.nitrd.gov
PITAC Recommendation
• The Federal government must implement coordinated, long-term computational science programs that include funding for interconnecting the software sustainability centers, national data and software repositories, and national high-end leadership centers with the researchers who use those resources, forming a balanced, coherent system that also includes regional and local resources.
• Such funding methods are customary practice in research communities that use scientific instruments such as light sources and telescopes, increasingly in data-centered communities such as those that use the genome database, and in the national defense sector.
Additional Challenges
In addition to the lack of an overall semantic interoperable framework for multidisciplinary research and public repositories…
• Motivation for research labs to adhere to standards, especially when storing and describing data. (example UNC stories).
• Ownership, provenance• Privacy and security (IRB approval for future
research)• Indexing, data mining
Challenges for Universities • Multiple cultures
– arts, humanities and social sciences– sciences and engineering
• Many scholarly communication approaches– books, monographs, journals, conferences
• access time, priority and intellectual property– multiple media and expression
• text, audio, video, artifacts, performances, …– primary and secondary source materials– professional societies and private publishers
• Institutional repositories– multiple visions and roles
• digital archives and/or alternative publication venues– research and education
• access modes and goals, not just articles or books• longitudinal access and lifelong learning
– what and how much to save• declining cost of storage and simplicity of deposit
Computing History
• 1890-1945– mechanical, relay– 7 year doubling
• 1945-1985– tube, transistor,..– 2.3 year doubling
• 1985-2003– microprocessor– 1 year doubling
• Exponentials– chip transistor density: 2X in ~18 months– WAN bandwidth: 64X in two years– storage: 7X in two years– graphics: 100X in three years
0
0
1
1,000
1,000,000
1,000,000,000
1880 1900 1920 1940 1960 1980 2000
Doubled every 7.5
years
Doubled every 2.3
years
Doubles every year
Operations per second/$
Source: Jim Gray
MicrocomputerRevolution
4K bit core plane
Computing Power Trends
http://www.transhumanist.com/volume1/moravec.htm
Identify Genes
Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4
Predictive Disease Susceptibility
Physiology
Metabolism Endocrine
Proteome
Immune Transcriptome
BiomarkerSignatures
Morphometrics
Pharmacokinetics
EthnicityEnvironment
AgeGender
Example: Linking Genotype and Phentotype to study diseases
Source: Terry Magnuson