17
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing Large Environmental Data Sets Dan Crichton, Program Manager, Earth and Planetary Science Data Systems Amy Braverman, Senior Statistician NASA/JPL

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

Embed Size (px)

Citation preview

Page 1: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Challenges of Analyzing Large Environmental Data Sets

Dan Crichton, Program Manager, Earth and Planetary Science Data Systems

Amy Braverman, Senior Statistician

NASA/JPL

Page 2: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Massive Data Sets in the Environmental Sciences

• Environmental science areas (not exhaustive):– Climate change science/climate modeling

• Global• Regional

– Environmental quality • Pollution• Epidemiology• Land use and natural resource management

– Decision support and disaster management• Climate change impacts • Policy decisions and treaty enforcement• Disaster response (flooding, drought, volcanoes, etc.)

Page 3: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Massive Data Sets in Climate Science

• Climate model output:

– originally intended as laboratory experiments to play “what if” (explore the physics by twiddling knobs and seeing what happens)

– now have greater policy implications wrt predictions into the future, attribution of causes, and characterizing uncertainties

• Observations:

– Improve process understanding and formulate hypotheses through exploratory data analysis

– Improve parameterizations (statistical description of sub-grid-scale processes)

– Establishment of long term data records

– Model evaluation

• comparison of model output against observations

• weighting multi-model ensemble members

Page 4: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Architecture Drivers: Data Intensive Science

• Increasing data volumes requiring new approaches for data production, validation, processing, discovery and data transfer/distribution (E.g., scalability relative to available resources)

– Roughly doubling in size every two years– Shift from compute to data intensive

• Increased emphasis on usability of the data (E.g., discovery, access and analysis)

• Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data)

– the benefits to science in bringing together and creating “fused” data products from multiple sources is critical in areas such as climatology where baseline data records are needed across measurements **

• Increasing distribution of coordinated processing, operations and analysis (E.g., federation)

• On the fly analysis

• Increased pressure to reduce cost of supporting new missions

• Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution)

Cumulative Volume of L2+ Products at All DAACs

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

FY00 FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09 FY10 FY11 FY12 FY13 FY14

Fiscal Year

Cumulative Volume (TB)

Page 5: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

NASA Earth Science Data Pipeline

Data Acquisition

and Command

Instrument

Operations

EDOS/GDS

L0A Processin

g

ScienceData

ProcessingL0BL1L2L3L4

SDS EOSDIS DAAC

ScienceData

Management Archive

&Distribution

Instrument

Operations

EDOS/GDS

L0A Processin

g

ScienceData

ProcessingL0BL1L2L3L4

SDS EOSDIS DAAC

ScienceData

Management Archive

&Distribution

EOSDIS Data Centers

ScienceData

Management Archive

&Distribution

ScienceData

ProcessingL0BL1L2L3L4

Science Data Systems

Instrument Operations

EDOS/Ground Data

Systems

L0A Processing

Science Teams Outreach Research

Mission Operation

s

Mission Operations

TDRS Network

On Board Processing

Page 6: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

EOSDIS DAAC’sEarth Observing System Data and Information

System Distributed Active Archive Centers

Page 7: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Using Satellite Observations to Enable Climate Model Evaluation

How to bring as much observational scrutiny as

possible to the IPCC process?

How to best utilize the wealth of NASA Earth observations

for the IPCC process?

Next Target : IPCC AR5Model Output Available for Analysis Spring 2011

Papers Due ~ Late 2011/Early 2012Report Completion 2013

Page 8: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Earth System Grid Federation

• DOE-funded federation to distribute climate model output to the climate modeling community• Common services for access to repositories and portals/gateways • Highly decoupled• Open source framework (software packaged and distributed) mandated by DOE SciDAC Program• A Recent question….how do you link observations and climate model output?

Page 9: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

ESG – NASA Integration

Page 10: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Moving to Data Intensive Science

• Traditional Pipelines vs. Online Dynamic Services

– Convergence between static pipelines and “on-the-fly” data processing and services

• Analysis of Distributed Data through Distributed Computational Services

– Push computational services to data

• Fused Data Products

– Generate new, fused data products

• Virtual Research Networks

– Provide a computing infrastructure for collaborative research

Page 11: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Traditional Analysis Approach

• User program must encode all functionality beyond gross-level access.

• Requires knowledge of specific instrument characteristics such as retrieval methods, format, measurement error characteristics and biases, etc.

• Difficulties multiply with more than one data source.

Credit: Braverman, Mattmann, Crichton

Page 12: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Emerging Paradigm for Analysis

• Push as much computation as possible to locations where the data reside; minimize data movement

• Deploy simple services to data centers that provide access and the computational functions to enable model-to-data analysis– Embrace service-oriented style of architecture

Credit: Braverman, Mattmann, Crichton

Page 13: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Data Integration

Combining AIRS and MLS requires:

– Rectifying horizontal, vertical and temporal mismatch

– Assessing and correcting for the instruments’ scene-specific error characteristics (see left diagram)

Page 14: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Model Intercomparison: Regional Example

Collect User Choices (GUI / command line)Collect User Choices (GUI / command line)

Load model data

Load model data

Retrieve obs from databaseRetrieve obs

from database

Spatial re-gridding onto common grid

Spatial re-gridding onto common grid

Time averagingTime averaging

Area -averagingArea -averaging

Annual cycle compositingAnnual cycle compositing

Metric CalculationMetric Calculation

Plot productionPlot production

Model file

Model file

RC

ME

T

optio

nal

e.g. calculate monthly means from daily data

e.g. calculate monthly means from daily data

e.g. calculate area-weighted

mean over user defined masked

region

e.g. calculate area-weighted

mean over user defined masked

region

e.g. calculate means of all Januarys, all Februarys etc

e.g. calculate means of all Januarys, all Februarys etc

e.g. calculate bias, RMS error etc

e.g. calculate bias, RMS error etc

e.g. map, time series plot, Taylor

diagram

e.g. map, time series plot, Taylor

diagram

ObservationsObservations

Page 15: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Computational Vision

Data Acquisition

and Command

Instrument

Operations

EDOS/GDS

L0A Processin

g

Instrument

Operations

EDOS/GDS

L0A Processin

g

Instrument Operations

EDOS/Ground Data

Systems

L0A Processing

Mission Operation

sMission

Operations

TDRS Network

On Board Processing

Applications

Analysis, Modeling and

ApplicationEnvironments/

Gateways

Other Data Systems (e.g.

NOAA)

Other Data Systems (e.g.

NOAA)

Other Data Systems (e.g.

NOAA)

Other Data Systems (e.g.

NOAA)

Other Data Systems

(e.g. NOAA)

Other Data Systems

(e.g. NOAA)

DecisionSupport

ScienceData

Processing

ScienceData

Manage

NASA Mission/Multi-Mission Data & Science Centers

ScienceData

Manage

NASA Mission/Multi-Mission Data & Science Centers

ScienceData

Manage

NASA Mission/Multi-Mission Data & Science Centers

Research

Science Teams

Page 16: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Research Challenges for Statistics

• What architectural design produces the most efficient system topology for the types of data movement that will be required given scientific objectives? Can we study this as an optimization problem?

• How do we design computational methods that exploit the system topology and its distributed nature? Need algorithms that operate on distributed data to produce statistics of interest, or approximations. Study this trade-off.

• Data analysis choreography: how to assemble algorithms most efficiently given a set of analysis goals? How to optimize the movement of data?

• How can statistics and other disciplines (e.g., computer science) education be better aligned?

Page 17: National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Summary

• Significant efficiencies may be achieved by thinking of data analysis and data access together rather than thinking of them as serial operations.

• In this paradigm, data sets are not static entities. They are virtual, possibly streaming data structures flowing across the internet, manipulated and combined on-the-fly as necessary for specific analyses.

• We need new statistical methods and algorithms optimized for this type of environment.