30
Mining Climate Data © M. Steinbach Mining Climate Data Michael Steinbach Vipin Kumar University of Minnesota /AHPCRC Collaborators: G. Karypis, S. Shekhar (University of Minnesota/AHPCRC) V. Chadola, S. Iyer, G. Simon, P. Zhang (UM/AHPCRC) P. N. Tan (Michigan State University) C. Potter (NASA Ames Research Center), S. Klooster (California State University, Monterey Bay). NASA funded project: Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining Additional support from Army High Performance Computing Research Center Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute.

University of Minnesota /AHPCRC

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of Minnesota /AHPCRC

Mining Climate Data

© M. Steinbach Mining Climate Data

Michael SteinbachVipin Kumar

University of Minnesota /AHPCRC

Collaborators:G. Karypis, S. Shekhar (University of Minnesota/AHPCRC)V. Chadola, S. Iyer, G. Simon, P. Zhang (UM/AHPCRC) P. N. Tan (Michigan State University) C. Potter (NASA Ames Research Center),S. Klooster (California State University, Monterey Bay).

NASA funded project: Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining

Additional support from Army High Performance Computing Research Center

Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute.

Page 2: University of Minnesota /AHPCRC

Overview

© M. Steinbach Mining Climate Data 2

Background

Data Mining Tasks– Detection of Disturbances and Associations

– Discovery of Climate Indices

Distributed Issues

Conclusion

Page 3: University of Minnesota /AHPCRC

Research Goal

© M. Steinbach Mining Climate Data 3

Global snapshots of values for a number of variables on land surfaces or water.Span a range of 10 to 50 years.Gridded data

Average Monthly Temperature

SST

Precipitation

NPP

Pressure

SST

Precipitation

NPP

Pressure

Longitude

Latitude

Timegrid cell zone

...

A key interest is finding connections between the ocean / atmosphere and the land.

Research Goal:Find global climate patterns of interest to Earth Scientists

Page 4: University of Minnesota /AHPCRC

The El Nino Climate Phenomenon

© M. Steinbach Mining Climate Data 4

El Nino is the anomalous warming of the eastern tropical region of the Pacific.

http://www.usatoday.com/weather/tg/wetnino/wetnino.htm

Normal Year: Trade winds push warm ocean water west, cool water rises in its place

El Nino Year: Trade winds ease, switch direction, warmest water moves east.

Page 5: University of Minnesota /AHPCRC

Overview

© M. Steinbach Mining Climate Data 5

Background

Data Mining Tasks– Detection of Disturbances and Associations

– Discovery of Climate Indices

Distributed Issues

Conclusion

Page 6: University of Minnesota /AHPCRC

Detection of Ecosystem Disturbances

© M. Steinbach Mining Climate Data 6

Can detect ecosystem disturbances by detecting sudden changes in “greenness” from satellite data– FPAR: Fraction of Photosynthetic Active Radiation absorbed by

the green part of vegetation.

Page 7: University of Minnesota /AHPCRC

Detection of Ecosystem Disturbances

© M. Steinbach Mining Climate Data 7

Release: 03-51AR

NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS

NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years.

http://amesnews.arc.nasa.gov/releases/2003/03_51AR.html

Major ecosystem disturbances detected in North America.

NASA image of patterns in the 18-year record (1982-1999) of global satellite observations of vegetation greenness from the Advanced Very High Resolution Radiometer (AVHRR). Different colored areas identify the major ecosystem disturbance events detected and the year they occurred. The majority of potential disturbance events pictured occurred in boreal forest ecosystems of Canada or shrublands and rangelands of the southern United States.

Smoke over Borneo, Indonesia

Page 8: University of Minnesota /AHPCRC

Mining Associations in Earth Science Data: Challenges

© M. Steinbach Mining Climate Data 8

Transaction Id

Items

1 Bread, Milk

2 Beer, Diaper, Bread, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Bread, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

How to transform Earth Science data into transactions?– What are the “baskets”?– What are the “items”?– How to define “support”?

Page 9: University of Minnesota /AHPCRC

Mining Associations Patterns in Earth Science Data: Challenges

© M. Steinbach Mining Climate Data 9

How to efficiently discover spatio-temporal associations?

– Use existing algorithms.– Develop new algorithms.

(Lat,Long,time) Events

(10N,10E,1) {Temp-Hi, Prec-Lo}

(10N,10E,2) {Temp-Hi,Prec-Lo,NPP-Lo}

(10N,11E,2) {Temp-Hi, NPP-Lo}

(10N,11E,5) {Solar-Hi, NPP-Lo}

(10N,11E,10) {Prec-Hi, PET-LO}

1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI

(support count=145, confidence=100%)

2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI

(support count=933, confidence=99.3%)

3 FPAR-HI PET-HI PREC-HI ==> NPP-HI

(support count=1655, confidence=98.8%)

4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI

(support count=268, confidence=98.2%)

How to identify interesting patterns?

– Use objective interest measures.– Use domain knowledge.

Page 10: University of Minnesota /AHPCRC

Example of Interesting Association Patterns

© M. Steinbach Mining Climate Data 10

FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%)

Shrubland areas

Rule has high support in shrubland areas

Page 11: University of Minnesota /AHPCRC

Overview

© M. Steinbach Mining Climate Data 11

Background

Data Mining Tasks– Detection of Disturbances and Associations

– Discovery of Climate Indices

Distributed Issues

Conclusion

Page 12: University of Minnesota /AHPCRC

Climate Indices: Connecting the Ocean/Atmosphere and the Land

A climate index is a time series of temperature or pressure– Similar to business or economic

indices– Based on Sea Surface Temperature

(SST) or Sea Level Pressure (SLP)

A climate index is a time series of temperature or pressure– Similar to business or economic

indices– Based on Sea Surface Temperature

(SST) or Sea Level Pressure (SLP)

Climate indices are important because– They distill climate variability at a regional or global

scale into a single time series. – They are well-accepted by Earth scientists.– They are related to well-known climate phenomena

such as El Niño.

Dow Jones Index (from Yahoo)

© M. Steinbach Discovery of Climate Indices Using Clustering 12

Page 13: University of Minnesota /AHPCRC

A Temperature Based Climate Index: NINO1+2

© M. Steinbach Mining Climate Data 13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

longitude

latit

ude

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

longitude

latit

ude

Correlation Between ANOM 1+2 and Land Temp (>0.2)

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

Correlation Between Nino 1+2 and Land Temperature (>0.2)

El Nino Events

Nino 1+2 Index

Page 14: University of Minnesota /AHPCRC

SST Clusters

© M. Steinbach Mining Climate Data 14

longitude

latit

ude

107 SST Clusters

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

Page 15: University of Minnesota /AHPCRC

SST Clusters That Reproduce El Nino Indices

© M. Steinbach Mining Climate Data 15

Niño Region

Range Longitude

Range Latitude

1+2 (94) 90°W-80°W 10°S-0°

3 (67) 150°W-90°W 5°S-5°N

3.4 (78) 170°W-120°W 5°S-5°N

4 (75) 160°E-150°W 5°S-5°N

longitude

latit

ude

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

75 78 67 94El Nino Regions Defined by Earth Scientists

Cluster Nino Index Correlation94 NINO 1+2 0.922567 NINO 3 0.946278 NINO 3.4 0.919675 NINO 4 0.9165

Page 16: University of Minnesota /AHPCRC

An SST Cluster Moderately Correlated to Known Indices

© M. Steinbach Mining Climate Data 16

Cluster 29 vs. SOI, ANOM 12, ANOM3, ANOM34, ANOM4 (mincorr = 0.2)

longitude

latit

ude

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

29

Corr Diff-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Cluster 29 vs. Known El Nino Climate Indices: Nino 1+2, Nino 3, Nino 3.4, Nino 4, and SOI

Page 17: University of Minnesota /AHPCRC

Overview

© M. Steinbach Mining Climate Data 17

Background

Data Mining Tasks– Detection of Ecological Disturbances

– Discovery of Climate Indices

Distributed Issues

Conclusion

Page 18: University of Minnesota /AHPCRC

Need for High Performance Computing

© M. Steinbach Mining Climate Data 18

SNN clustering analysis require O(n2) comparisons.

Association rule algorithms can also be very compute intensive.– Potentially very much greater than

O(n2)

Amount of memory required exceeds for clustering and association rule algorithms can exceed 4GB of traditional sequential servers

Pairwise correlation between every land and ocean pixels very time consuming on a sequential computer

Page 19: University of Minnesota /AHPCRC

High Resolution EOS Data

© M. Steinbach Mining Climate Data 19

EOS satellites provide high resolution measurements– Finer spatial grids

8 km × 8 km grid produces 10,848,672 data points

1 km × 1 km grid produces 694,315,008 data points

– More frequent measurements– Multiple instruments

Generates terabytes of day per day

High resolution data allows us to answer more detailed questions:– Detecting patterns such as trajectories, fronts, and

movements of regions with uniform properties– Finding relationships between leaf area index (LAI)

and topography of a river drainage basin– Finding relationships between fire frequency and

elevation as well as topographic position

Earth Observing System (e.g., Terra and Aqua satellites)

http://www.crh.noaa.gov/lmk/soo/docu/basicwx.htm

Page 20: University of Minnesota /AHPCRC

Distributed System For Analyzing Earth Science Data

DistributedData Sources

Data Mining

Classificationand Regression

DeviationDetection

AssociationAnalysisClustering

Local Data

Dat

a A

cqui

sitio

n, F

usio

n an

dTr

ansf

orm

atio

n

Data Exploration and Pre-Processing

Visualization

StatisticalAnalysisSubsetting

EventDetection

TrajectoryAnalysis

Out

put a

nd P

rese

ntat

ion

... ...

DAAC

ESIP

ESIP

ClimateData

ModelData

Page 21: University of Minnesota /AHPCRC

Data Acquisition, Fusion, and Transformation

DistributedData Sources

Data Mining

Classificationand Regression

DeviationDetection

AssociationAnalysisClustering

Local Data

Dat

a A

cqui

sitio

n, F

usio

n an

dTr

ansf

orm

atio

n

Data Exploration and Pre-Processing

Visualization

StatisticalAnalysisSubsetting

EventDetection

TrajectoryAnalysis

Out

put a

nd P

rese

ntat

ion

... ...

DAAC

ESIP

ESIP

ClimateData

ModelData

This module will provide the ability to acquire the data necessary for the analysis, and since the data comes from heterogeneous sources, the ability to fuse and transform the data.

Page 22: University of Minnesota /AHPCRC

Data Acquisition, Transformation, and Fusion

© M. Steinbach Mining Climate Data 22

Challenges– Locate and download data when it becomes

availableWeb services such as directory services and peer-to-peer networking capabilities for file sharing

– Data fusionConversion between different formats: HDF, HDF-EOS, netCDF, binary, ASCII, Earth Science Markup Language (ESML), Geographic Markup Language (GML)

– Data TransformationData transformation such as scaling, radiometric conversion, sampling in time, aggregation, and mathematical or geometrical map transformations to convert the incoming data to the same coordinate system

Page 23: University of Minnesota /AHPCRC

Data Exploration and Pre-processing

DistributedData Sources

Data Mining

Classificationand Regression

DeviationDetection

AssociationAnalysisClustering

Local Data

Dat

a A

cqui

sitio

n, F

usio

n an

dTr

ansf

orm

atio

n

Data Exploration and Pre-Processing

Visualization

StatisticalAnalysisSubsetting

EventDetection

TrajectoryAnalysis

Out

put a

nd P

rese

ntat

ion

... ...

DAAC

ESIP

ESIP

ClimateData

ModelData

This module consists of several visualization, statistical, and time series preprocessing tools for supporting the exploratory analysis of large-scale Earth Science datasets. Such tools can be used to aid scientists in gaining an initial insight into the distribution, regularity, and quality of the input data.

Page 24: University of Minnesota /AHPCRC

Data Mining

DistributedData Sources

Data Mining

Classificationand Regression

DeviationDetection

AssociationAnalysisClustering

Local Data

Dat

a A

cqui

sitio

n, F

usio

n an

dTr

ansf

orm

atio

n

Data Exploration and Pre-Processing

Visualization

StatisticalAnalysisSubsetting

EventDetection

TrajectoryAnalysis

Out

put a

nd P

rese

ntat

ion

... ...

DAAC

ESIP

ESIP

ClimateData

ModelData

Data mining technology offers a suite of advanced decision support tools to facilitate the automatic generation of scientific hypotheses from data.

Page 25: University of Minnesota /AHPCRC

Data Mining and Exploration

© M. Steinbach Mining Climate Data 25

Challenges– Complex data distribution

Data may be split according to time periods, region, attribute, etc.Example: To obtain different types of data for a given point on the Earth it is often necessary to pull data from many sources.

– Distributed computationFor resource or other reasons, data mining tasks may be better executed by distributing the computation across resources in multiple organizations.Example: Finding specific events of interest; summarizing data

– Time vs. Accuracy TradeoffUser should be able to make choicesExample: Time series similarity using correlation vs. time series similarity computed using dynamic time warping.

Page 26: University of Minnesota /AHPCRC

Output and Presentation

DistributedData Sources

Data Mining

Classificationand Regression

DeviationDetection

AssociationAnalysisClustering

Local Data

Dat

a A

cqui

sitio

n, F

usio

n an

dTr

ansf

orm

atio

n

Data Exploration and Pre-Processing

Visualization

StatisticalAnalysisSubsetting

EventDetection

TrajectoryAnalysis

Out

put a

nd P

rese

ntat

ion

... ...

DAAC

ESIP

ESIP

ClimateData

ModelData

Output and presentation tools will convert results to the commonrepresentations (e.g., MIME, ESML/GML/XML, jpeg, binary, etc). Results can be posted to the Web and discussed in a collaborative fashion or easily incorporated into more traditional publications.

Page 27: University of Minnesota /AHPCRC

Output and Presentation

© M. Steinbach Mining Climate Data 27

Challenges– Allow others to locate and download results

when they becomes availableWeb services such as directory services and peer-to-peer networking capabilities for file sharingDescribing the data via XML, Earth Science Markup Language, etc.Data Transformation

– Privacy and Policy ConstraintsDifferent levels of access needed

Page 28: University of Minnesota /AHPCRC

Usage Scenario

© M. Steinbach Mining Climate Data 28

Download software from a web site and install it.Launch the application and collect data from well-known sources on the Web or from local sources.The system selects the appropriate format transformation and data fusion steps to convert all the data into a single co-registered format. – Use the metadata associated with the retrieved data to

understand the resolution, spatio-temporal framework, attributes

– User inputUse analysis tools and subsets of the data to perform pre-processing, data exploration, data mining, and post-processing.Select results to be published on the Web, allowing collaboration and access via the Internet.

Page 29: University of Minnesota /AHPCRC

Conclusions

© M. Steinbach Mining Climate Data 29

Disturbance and association analysis can uncover interesting patterns for Earth Scientists to investigate.By using clustering we have made some progress towards automatically finding climate patterns that display interesting connections between the ocean and the land.Many more opportunities for data mining/data analysis in Earth Science data. Many opportunities for distributed computing to play a useful or critical role.

Page 30: University of Minnesota /AHPCRC

Questions?

More information can be found at

http://www.ahpcrc.umn.edu/nasa-umn/index.html