2 nd NASA Data Mining Workshop: Issues and …projects.itsc.uah.edu/datamining/meeting06/docs/2nd_NASA...On May 23-24, 2006, NASAÕs Earth Science Division sponsored the Second NASA

2nd

NASA Data Mining Workshop:

Issues and Applications in Earth Science

May 23rd – 24th, 2006

Pasadena, California

Final Report

2

Table of Contents Executive Summary ...................................................................................................... 3

Key Findings........................................................................................................... 3

Recommendations ................................................................................................... 4

1. Overview of the Workshop .................................................................................... 6

a. Objectives ............................................................................................................ 6

b. Attendance ........................................................................................................... 7

c. Agenda ................................................................................................................. 7

2. Analysis of Results.................................................................................................. 8

a. Current and Emerging Technology Themes .......................................................... 8

b. Connection to NASA’s Earth Science Agenda.................................................... 10

c. Infusing Statistics and Data Mining into Earth Science Research ........................ 16

3. Recommendations ................................................................................................ 20

Appendix 1: Call for Papers ....................................................................................... 22

Appendix 2: List of Attendees .................................................................................... 25

Appendix 3: Final Agenda .......................................................................................... 26

Appendix 4: Summary of Workshop Presentations .................................................. 29

Appendix 5: Current and Emerging Technology Themes......................................... 36

3

Executive Summary

On May 23-24, 2006, NASA’s Earth Science Division sponsored the Second NASA

Data Mining Workshop: Issues and Applications in Earth Science Data, held in

conjunction with the Interface 2006 Symposium, at the Pasadena Westin Hotel. The

workshop, which was organized by a team from NASA and the University of Alabama in

Huntsville (UAH), was a successor to a previous workshop held in Huntsville in October

1999. The objectives of this second workshop were to again bring together researchers

from the Earth science, data mining, and statistics communities to see what results had

been achieved in the intervening 6 years, as well as to identify areas where data mining

and statistics could potentially yield significant scientific advances in Earth science.

The workshop consisted of an opening session of introductory talks from NASA and

Interface participants, five sessions of invited talks, and poster presentations. These were

selected from responses to a call for papers and organized by science topic. The

workshop concluded with a panel session on “How to promote the infusion of data

mining and statistical technologies into Earth science”. Sessions included discussion

time following presentations so participants could interact with the speakers and with

each other. A web page describing the workshop and containing all of the abstracts and

presentations can be found at: http://datamining.itsc.uah.edu/meeting06/index.html.

Key Findings

• The data mining and statistical methods presented at this workshop are considerably

more mature than they were 6 years ago. The presentations provided important

insights in a number of areas, with many of the techniques showing potential for

significantly advancing scientific understanding in various areas of Earth science.

• Data analysis approaches that have been historically employed in Earth science are no

longer adequate for dealing with the complexity, size, and novelty of NASA’s 21st

century data resources. New statistical methodologies and data mining algorithms that

address these issues need to be developed and infused into mainstream Earth science

research.

• The chief obstacles to infusion of modern data analysis methods in Earth science are:

1) the lack of publication venues and funding opportunities that promote innovative

data analysis in Earth science research, 2) the disconnect between “modeling the

data” and relating it back to underlying physical processes, and 3) the hesitancy of

Earth scientists to adopt new data analysis methods that have not been fully vetted

and accepted within their community.

• A conceptual framework is needed to articulate the roles that statistics and data

mining can play in advancing Earth science research. Such a framework should: 1)

link questions about Earth system processes to questions about data, and 2) provide

4

an infrastructure for making inferences from the data back to the underlying state of

the Earth’s system, and translating those inferences into physically meaningful

conclusions. Section 2 of this report outlines one possible framework, based on the

uses of data in NASA’s Earth science research program.

• Promoting collaboration between the existing Earth science, statistics, and data

mining communities is useful, but not sufficient, due to the intellectual and cultural

barriers. Establishing a new professional community, composed of researchers who

work in the intersection of Earth science, statistics, and data mining, may yield

greater impact over the long-term.

Recommendations

• NASA Program Managers in both the Earth science and technology development

areas should work together to meld modern data analysis research into mainstream

Earth science research by: 1) adding criteria to proposal opportunities that require or

reward development and/or use of modern data analysis methods in Earth science

research, and 2) establishing a funding mechanism specifically for the development of

new statistical and data mining methodologies that respond to data analysis problems

arising from the use of massive observational data sets to answer key Earth science

questions.

• NASA should take the lead in establishing a new professional community dedicated

to scientific discovery through the development and use of modern statistical and data

mining methods. This must go beyond collaboration to foster a new generation of

researchers with training in both Earth science and data analysis. Through its

education programs, NASA should encourage this type of interdisciplinary training at

both graduate and undergraduate levels. Today’s students are tomorrow’s members

of the new community.

• NASA should form a new working group to be made up of community leaders who

work in the intersection of Earth science and statistics/data mining. This Earth

Science Data Exploration and Analysis Working Group (ESDEAWG) would:

- identify areas where current data analysis practices could be improved, either

through development of new techniques or infusion of existing ones that are

hitherto unexploited (or underexploited);

- develop a Technology Readiness Level (TRL) ladder that is appropriate for

measuring progress and maturity of data analysis methodologies;

- provide recommendations to NASA on fostering the interdisciplinary

professional community described;

- establish a set of standard hybrid statistical-physical process models (see

Section 2 of this report) that can be used to calibrate research results based on

different methodologies against one another;

- identify or create a set of benchmark datasets the community can use to test

and compare different methodologies on the same data;

5

- formulate a framework that articulates the roles of statistics and data mining

as a means to advance Earth science in NASA’s approach to scientific

discovery;

- encourage established geoscience journals to devote special issues to new

methods for data analysis1.

• NASA should hold workshops in Earth science, data mining, and statistics on a more

frequent basis. NASA should also sponsor a set of focused tutorials designed to train

Earth scientists in modern statistics and data mining and to train statisticians and data

miners in Earth science. These could be held in conjunction with existing professional

meetings such as AGU or AMS, or could be in the form of a Gordon Research

Conference (see http://www.grc.uri.edu/).

1 In the biological sciences, there are dedicated journals, such as Nature Methods, that fulfill this need.

6

1. Overview of the Workshop The Second NASA Data Mining Workshop: Issues and Applications in Earth

Science Data was held on May 23-24, 2006 in conjunction with the Interface 2006

Symposium at the Pasadena Westin Hotel. The workshop was organized by a team from

the NASA and the University of Alabama in Huntsville (UAH), and was the successor to

the First NASA Data Mining Workshop held in Huntsville in October 1999. Members of

the organizing committee are listed below. These individuals conceived and produced

this workshop thanks to sponsorship by NASA’s Earth Science Division. The Program

Committee reviewed and selected the papers for presentation. The organizers, Program

Committee members, and additional contributors (all listed below) helped shape and run

the workshop and write this report.

Organizers

Amy Braverman NASA/Jet Propulsion Laboratory

Elaine Dobinson NASA/Jet Propulsion Laboratory

Sara J. Graves University of Alabama in Huntsville

Program Committee

Michael C. Burl NASA/Jet Propulsion Laboratory

Becky Castano NASA/Jet Propulsion Laboratory

Thomas Hinke NASA/Ames Research Center

Christopher S. Lynnes NASA/Goddard Space Flight Center

Bernard Minster University of California San Diego

Rahul Ramachandran University of Alabama in Huntsville

Additional Contributors

Jeanne Behnke NASA/Goddard Space Flight Center

Lynne Carver University of Alabama in Huntsville

Michael Garay NASA/Jet Propulsion Laboratory

Stephanie Granger NASA/Jet Propulsion Laboratory

Danny Hardin University of Alabama in Huntsville

Brian Wilson NASA/Jet Propulsion Laboratory

a. Objectives

Data from Earth-orbiting satellites have been accumulating at a very high rate for several

years now. In combination with in-situ observations and physics-based simulations, this

enormous, distributed repository holds answers to important questions about our planet’s

past, present and future. However, the information is only accessible if effective analysis

capabilities can be brought to bear. Data mining and statistics have the potential to

provide these capabilities, and, if employed in close coordination with Earth science

research, will significantly increase the science return from NASA’s vast Earth science

data collection.

7

The objectives of the Second NASA Data Mining Workshop were to 1) bring together

Earth scientists, statisticians, and data miners to match the needs of the scientific

community to existing capabilities provided by these data analysis experts, and 2) suggest

future research directions for data analysts to pursue to help advance Earth science

research. In particular, the workshop sought to facilitate formation of collaborative

relationships between Earth scientists and data analysts, and to identify specific problems

that these collaborations can address. To this end, NASA issued an open call for papers2,

which is included in this report as Appendix 1.

b. Attendance

The workshop was attended by approximately 50 people, including NASA program and

project managers, presenters from the statistics and data mining communities, Earth

scientists, and members of the organizing and program committees. The workshop size

was consistent with the organizing and program committees’ goal of having a diverse, but

productive workshop. The full list of attendees and their affiliations can be found in

Appendix 2.

c. Agenda

The workshop agenda consisted of an opening session of introductory talks from NASA

and Interface participants followed by five sessions of invited talks and a poster session.

The Workshop concluded with a panel session on “How to promote the infusion of data

mining and statistical technologies into Earth science”. The complete workshop agenda

is listed in Appendix 3.

Opening the first day, Dr. Francis Lindsay from NASA’s Earth Science Division

discussed NASA’s data mining and analysis objectives and their relation to the

workshop’s objectives. He discussed NASA funding opportunities relevant to data

mining and statistical analysis and highlighted the contrast between core and community

data system development at NASA. He charged the participants with helping to narrow

the gap between information technology and Earth science and suggesting strategies

for moving these techniques into pertinent Earth science communities and data

systems. Next, Dr. Mary Ann Esfandiari, Program Manager for EOSDIS, spoke about

NASA’s plans for evolving its data systems. She presented a data system vision for the

year 2015 that provides increased interoperability, and greater flexibility and support for

users. Concluding the opening session, Dr. Ed Wegman of George Mason University

gave a talk, entitled “Statistics, Data Mining, and Climate Change”, which showed the

danger of using improper data analyses to reach scientific conclusions.

The remaining sessions consisted of oral and poster presentations from the collection of

papers submitted in response to the call. A total of 37 papers were submitted to the

workshop. Of these, 16 were selected for oral presentations, and 15 were chosen for

posters. The posters were presented the first evening during an informal reception. The

2 The call was posted in standard advertising venues for the computer science/data mining communities

(e.g., KDNuggets, ACM Calendar of Events) and for the Earth science community (AGU Meeting and

EOS). In addition, the call was directly sent to over 300 individuals with interest in scientific data mining.

8

oral presentations covered a wide spectrum of scientific data mining and statistical

applications relevant to NASA’s Earth science mission. They also covered a variety of

different techniques and approaches. Appendix 4 contains a complete summary and

discussion of the workshop presentations.

The following sections of this report are devoted to more thorough analyses of aspects of

the workshop program, and a set of recommendations resulting therefrom.

2. Analysis of Results In this analysis we examine the intellectual content of the workshop in order to formulate

a coherent picture of the state of the practice, and to suggest avenues for advancing

contributions of statistics and data mining to NASA’s Earth science research objectives.

a. Current and Emerging Technology Themes

As shown in Appendices 3 and 4, the workshop sessions were organized around science

discipline, in alignment with NASA’s Earth Science program. However, in the following

discussion and more completely in Appendix 5, we regroup (or reinterpret) the workshop

content according to technology theme, which leads to valuable insight about the current

state-of-the-practice and suggests possible directions for future work. During the course

of the workshop, it became clear that: 1) within a given science discipline, a broad variety

of data mining techniques could be applied, and 2) across different science disciplines,

the same data mining technique might be used3. Table 5.1 in Appendix 5 categorizes the

workshop papers by both science focus area and technology theme. This categorization

illustrates several points pertaining to the current state-of-the-practice and future themes.

Current State-of-the-Practice A significant number of papers involve infrastructure activities and end-user tools that are

not tied to any specific science focus area. Infrastructure activities include, for example,

data management methods for organizing, storing, querying, and transferring large

volumes of data, creating and exploiting ontologies and metadata, and methods for

parallel and distributed data mining. End-user tools allow for retrieval, visualization, and

other interactive operations with large datasets. Although the initial reaction may be that

technologies should be more tightly coupled to science focus areas, it must be

acknowledged that there is indeed much in common across focus areas (especially at this

low-level of data manipulation and processing); hence, factoring the commonalities into

infrastructure that can support multiple focus areas is reasonable. Researchers pursuing

such efforts are cautioned, however, that connecting their efforts back to the science

datasets and to the needs of the scientists themselves is critical to ensure useful products

emerge. A tool without a user or use is likely a wasted effort.

A second observation is that the most dominant application of data mining and statistics

to Earth science data involves the use of supervised learning techniques for land cover

classification (under the Carbon & Ecosystems focus area) with such applications now

3 As a concrete example of the latter, support vector machines (SVMs) were used both for land cover

classification and atmospheric cloud classification.

9

becoming fairly mature. In fact, very similar methods have been applied to cloud

classification and incorporated into the Langley DAAC for MISR processing.

Another popular application for data mining and statistics involves the use of clustering

techniques for spatio-temporal pattern identification usually at a global or regional scale

with most of the activity in the Climate and Carbon/Ecosystems focus areas. This activity

likely reflects the broad interest in climate change and its impact on and interactions with

ecosystems, as well as the suitability of EOS data for doing these types of studies.

A more complete analysis of these and other observations is given in Appendix 5, along

with the table.

Future Themes Given the current state-of-the-practice, can we identify current theoretical developments

or problem areas that are likely to have significant impact five years down the road?

Clearly, there are many areas in Table 5.1 that are empty or sparsely populated. These

cells indicate opportunities where data mining and statistical methods can potentially

create breakthroughs. Some of these we have already noted, but one of the more

promising areas for future work is the incorporation of physically-inspired process

models into data mining endeavors. In Table 5.1 we see that there were very few papers

at the workshop that attempted to incorporate domain knowledge directly into the process

model. Many of the scientists at the workshop lamented that the data mining and

statistical approaches may do a good job of modeling the data, but do not provide insight

into why they’re doing a good job or a connection back to underlying physical processes.

By combining a physically-inspired process model with an observational (data) model

and characterizing the uncertainties within these two models, we may better be able to

make inferences about the underlying physical processes. Section 2b below expands upon

this idea, providing a more detailed conceptual model and discussion.

Generalizing the Role of Data Mining Methods in Earth Science Research The roles these technologies play in Earth science research as a whole depend on the

particular method and application, but can be broadly categorized into two classes: 1)

Data Characterization and Feature Detection, and 2) Causal Analysis and Anomaly

Discovery. Data characterization and feature detection includes such technologies as

classification techniques, kriging and uncertainty analysis, and clustering and statistical

summarizations. Typically, the primary focus is on providing a more understandable

characterization (or view) of the underlying structure of large amounts of science data.

While they are not generally targeted toward directly extracting scientific results, these

methods can make massive data sets comprehensible and thus tractable to further

scientific analysis. An important aspect of these techniques is that the problem to be

solved is generally well-defined. Thus, the applicability of a given technique can be

determined, so that the risk is relatively low.

On the other hand, causal analysis and anomaly discovery are characterized by the

discovery of novel relationships among variables. Some examples include the inference

of predictive models and the discovery of unexpected phenomena. An example is the

10

search for novel climatic indices and related teleconnections. This role is less common in

the scientific data mining world. This is not surprising since the novelty aspect makes it

difficult to ensure beforehand (say, at grant application time) that a useful result can be

achieved. However, it is that very novelty that also makes for a potentially high reward.

As such, it represents an important niche for the systematic application of certain data

mining techniques as an alternative to the (often) serendipitous nature of human

discovery. This does not cede the role of scientific discovery to data mining, since such

techniques typically do not provide a physical explanation or model. Rather, data

mining should proceed hand in hand with methods more grounded in the natural

sciences, the first identifying novel aspects for study, the second fitting them into an

understandable scientific model or framework.

b. Connection to NASA’s Earth Science Agenda

The workshop discussions repeatedly stressed that success in achieving relevance in

NASA’s mainstream scientific endeavors depends upon the central and continuous

involvement of the Earth science community, and this success is unlikely without a more

cohesive, focused data analysis community dedicated to solving Earth science problems.

Themes that resonated and recurred throughout the workshop included the relationship

between data analysis (a term used here to encompass both statistics and data mining) and

science understanding, the need for infusion beyond simple collaboration, and the need

for community. The remainder of this section is devoted to discussing the problems of

using data to advance scientific knowledge, and to discussing the issues of collaboration

and community in detail.

As noted previously, many interesting and successful projects were presented at this

Second Workshop. The data mining and statistics communities have clearly made

significant headway toward developing methods to address NASA’s Earth science

technology needs since the First Workshop in 1999. However, still more can be done to

infuse these methods into mainstream Earth science, and to develop new methods in

response to new problems. To accomplish this, the role of data in NASA’s approach to

Earth science needs to be examined, and the uses of data mining and statistics need to

be focused more clearly on turning those data into science understanding.

Understanding Earth Science Data A primary focus of this workshop was to determine how data mining and statistical

analysis could advance scientific understanding of the Earth’s system. NASA’s Earth

Observing System (EOS) datasets provide massive quantities of observational data and

are an enormous resource for scientists working to understand physical processes that

make up that system. However, both the volume and complexity of the data sets are

impediments to their full exploitation. The workshop presented many methods and

approaches to address these issues, but as we listened to the varied problems and

solutions it was difficult to understand the relationships among them. A fundamental

conclusion is that a conceptual model or framework is needed to organize these

methods, one that is rigorous enough to provide structure and guidance, but flexible

enough to accommodate the wide variety of techniques and issues.

11

How, then, can we construct such a flexible but rigorous framework within which to link

broad Earth science questions to statistical analysis and data mining methods that may

help answer them? To bridge this gap two things are required. First, science questions,

which are questions about the Earth’s systems’ processes, must be translated into

questions about data. Second, answers to data questions must be linked back to

underlying physical processes. If these two requirements can be made quantitative, they

will provide the required structure.

To examine the role and uses of data to solve key Earth science questions, we use the

NASA “Climate Variability and Change” science focus area as an example of a NASA-

defined science objective. We present a very general conceptual model that provides a

mechanism for inference, and helps organize the problems, techniques and applications

presented at the workshop in order to better target our community’s efforts to further

NASA’s science goals.

NASA’s Approach to Science

The Climate Variability and Change Roadmap (Figure 1) poses the following questions:

• How is the global ocean circulation varying on interannual, decadal, and longer

time scales?

• What changes are occurring in the mass of the Earth's ice cover?

• How can climate variations induce changes in the global ocean circulation?

• How is global sea level affected by natural variability and human-induced change

in the Earth system?

• How can predictions of climate variability and change be improved?

According to the Roadmap’s “Where we plan to be” box, NASA seeks to characterize

and reduce the uncertainties in long-term climate prediction and provide routine

probabilistic forecasts of key climate variables. This speaks to the last question above:

how can predictions of climate variability and change be improved? The other four

questions appear to be aimed at better understanding oceanic and cryospheric component

processes that both influence and are influenced by the general climate system. Other

roadmaps identify similar types of questions related to other Earth system components.

This organization provides a general picture of how NASA’s scientific community

approaches its work to answer key questions: through a continuum of research and

modeling efforts to understand physical processes ranging from very specific, highly

constrained local and regional studies, to coupled global climate models which link

process models together and simulate feedbacks. The community has achieved an

impressive compromise between decentralized and centralized efforts, the latter led by

major modeling centers such as the National Center for Atmospheric Research (NCAR)

and the Geophysical Fluid Dynamics Laboratory (GFDL).

12

Figure 1. Climate Change and Variability Roadmap.

The Role of Data

Data contribute to this process in several ways. First, exploratory data analysis (EDA)

elucidates and/or discovers variables and relationships that help advance understanding of

physical processes. To the extent that physical models represent the community’s best

understanding of these processes, data contribute to model improvement. We call this

role “hypothesis formulation and discovery”. Second, data make hypothesis testing

possible. Formal hypothesis tests are part of confirmatory data analyses (CDA, i.e.,

13

inference) that assess the magnitudes of observed phenomena relative to their

uncertainties. Only when the former exceed the latter by a large enough margin do we

have a basis for concluding that phenomena are statistically significant. Otherwise,

observed phenomena might be artifacts of sampling. We call this role “hypothesis testing

and model diagnosis”. Here it is especially important to distinguish between hypotheses

about data and hypotheses about underlying physical processes. Data in hand provide a

particular, but not necessarily representative, view of underlying physical mechanisms.

Third, in coupled climate models, or even in complex physical process models, statistical

descriptions of data provide “parameterizations” of components that need to be

represented, but which are insufficiently understood to model deterministically. Here

again it is crucial that these statistical descriptors respect the underlying process and not

just the observed data.

These three roles of data aid directly in achieving better understanding of the climate

system, and therefore will help improve long-term climate prediction by deterministic

models. A fourth role is data assimilation, which provides real-time optimal estimates of

important quantities by combining deterministic model predictions and observations in a

Bayesian statistical framework. Unlike the previous three applications, in which data are

used to improve deterministic models that then make predictions, data assimilation

ingests model predictions after the predictions are made, and updates them in view of the

“evidence” provided by observations. Assimilation produces evolving, best estimates of

the true underlying state of the system. We do not include this fourth use of data in

formulating our conceptual view because the goal of assimilation is not to produce better

deterministic models, but to produce evolving best estimates of process-level climate

variables in real time.

A Framework for Data Analysis and Scientific Inference

In this section we offer a simple example of a conceptual model that relates the data

observed to the underlying physical mechanisms that generate them. The model has two

parts. The first part, called a statistical process model, states that the actual value of the

quantity of interest is the sum of a base value and local variation. The second part, called

the data model, states that the observed value of the quantity of interest is the sum of the

actual value and measurement error.

A statistical process model is a probability model that describes the behavior of quantities

with probability distributions. For example, “Actual total column water vapor in

Bakersfield, California in December is the sum of base value

!

µ and a Gaussian local

perturbation with zero mean and variance

!

" 2” is an example of a very simple statistical

process model. Unlike a deterministic model, which would describe total column water

vapor by physical equations, this statistical model views the data generating mechanism

probabilistically.

The data model describes the relationship between the actual value of the quantity of

interest and observations of it. For example, “Measured total column water vapor in

Bakersfield, California in December is subject to error, where that error follows a

Gaussian distribution with mean

!

" and variance

!

" 2” is a simple data model. Combining

14

the statistical process and data models provides a route to inference: to produce an

estimate of the base value, one uses the mean of the observations. To quantify uncertainty

in that estimate, one must account for both the measurement uncertainty (

!

" 2) and the

variation of the actual values (

!

" 2).

This conceptual model is very general, and to be of practical use it must be tailored to the

specific setting of the problem in which it is to be used. For instance, in the Bakersfield

example, one may ask whether: a) the Gaussian distribution is the right distribution; b)

the values around Bakersfield are statistically independent of one another as is implied by

treating all local perturbations as if they arose from a single Gaussian; c) the spatial scale

attributed to the base value is appropriate to the physical phenomenon being studied; d)

the spatial correlations embodied in local perturbations about the base value are

appropriate; e) measurement error is independent of the base value; and very importantly,

f) can the base values themselves be described with a physical model? If f) is true, the

statistical process model becomes a hybrid, statistical-deterministic model and is a

mechanism for explicitly injecting scientific knowledge into the analysis.

There will be a host of such questions specific to any given investigation, and the answers

need to be codified mathematically so that uncertainty and biases are properly handled.

Sometimes the goal will be to obtain estimates, along with their uncertainties, of

physically meaningful parameters, and at other times the goal may be to develop a good

hybrid statistical-deterministic model capable of making accurate predictions at places or

times where measurements are unavailable. Another goal may be to make optimal

predictions or estimates where multiple data sources provide conflicting information

about the same or related quantities. This is the data fusion problem. In all cases it is

essential to base conclusions on a model framework that is scientifically sound and

statistically rigorous. This two-part concept, made up of a statistical-deterministic

process model and a data model, provides a mathematical bridge between the data and

the underlying physical processes we seek to understand. We need to cross that bridge

in order take our conclusions from the data we analyze to answers to the Earth science

questions.

The Role of Data Mining and Statistical Analysis

Earlier we described three distinct roles of data in NASA’s approach to science: 1)

hypothesis formulation and discovery, 2) hypothesis testing and model diagnosis, and 3)

parameterization. The role of data mining and statistical analysis is to provide

machinery for employing data in each of these roles.

With respect to hypothesis formulation and discovery, NASA’s remote sensing data sets

are often so large that, like other massive data sets, their volume makes working with

them difficult. In fact, a fundamental problem, which many works presented at this

workshop addressed, is characterizing the data and relationships therein. For example,

supervised learning methods seek to classify or estimate the value of an unknown

quantity in the main body of a data set based on relationships observed in a portion of the

data. In other words, some data points are complete in that they contain observations of

both a quantity, say y, to be predicted and variables x that may be useful for making the

15

prediction. The set of available (x,y) pairs is called the training set. In a large portion of

the data, the quantity to be predicted (y) is missing. One viewpoint on supervised learning

is that it characterizes the joint distribution of x and y using the training data, and then

extrapolates that relationship to the remainder of the dataset in order to fill in the missing

information. The objective is to discover systematic relationships that were not known

previously, and understand their implications. However, all the uncertainty associated

with this procedure arises from uncertainty surrounding the representativeness of the

training data relative to the full dataset and the inductive bias of the particular learning

algorithm used. The procedure does not include a formal attempt to infer corresponding

systematic relationships in the process that generated the data.

This may be one reason why data mining techniques are not widely accepted in Earth

science: its methods are often directed only toward discovery, and not toward hypothesis

formulation or testing. Few data analysts follow through to formulate testable, physical

hypotheses. This means teaming with Earth scientists to understand why, in physical

terms, observed relationships occur. This is hypothesis formulation and brings home the

point that, for scientists, pure predictive capability is not enough: scientists want to

understand why. Mining applications that do not provide information relevant to

hypothesis formulation will not capture the attention of the Earth science community.

Once a physical hypothesis exists, the next step is to formulate a test of that hypothesis

relative to the underlying statistical or statistical-deterministic process model.

Constructing an appropriate process model that adequately captures physical

understanding and properly treats uncertainty requires both statistical and scientific

expertise. The hypothesis is a statement about the process model for which the

implications can be mathematically propagated through the data model. These

implications exhibit themselves as probabilistic statements about what we would expect

to observe in the data if the hypothesis were true. By comparing the actual data to that

which is expected, the test determines whether or not data and hypothesis are consistent

with one another, and attaches a probabilistic confidence statement to the conclusion. If

the process model, the data model, or the propagation is incorrect, the conclusions will

not be credible. Moreover, any statistical parameterizations used as placeholders for

missing physical specification must be relative to the underlying process, not just the

observed data.

This discussion led to an important conclusion: to be more effective in Earth science,

data miners and statisticians must work together with physical scientists to complete

the chain of analysis through to inference about the Earth’s system. This will be a very

problem-specific task. The exact forms of the data, process and physical models involved

will be unique, and must be melded together carefully if uncertainties are to be

represented properly.

16

c. Infusing Statistics and Data Mining into Earth Science Research

At the 1999 NASA Data Mining Workshop participants voiced a concern that without the

development of practical data mining methods a great deal of Earth science data would be

underexploited, fundamentally undermining efforts to increase understanding of the Earth

system. Certainly in some measure, this concern has been realized. The Earth science

community still appears to view statistical and data mining applications in Earth science

as solutions in search of problems. Involvement of Earth scientists, while greater than six

years ago, remains relatively low.

Teaming Earth scientists and data analysis professionals is fundamental to achieving

success in this modern era of massive Earth science datasets. These datasets are so large

and complex that traditional data analytic methods are unable to exploit them fully.

Moreover, important scientific questions these data can help answer may never even be

asked because the lenses provided by simple techniques traditionally used by Earth

scientists are not sufficiently discriminating. Therefore, it is essential that modern

statistical and data mining methods become part of mainstream Earth science.

Workshop attendees identified a number of cultural and technical factors contributing to

the isolation of Earth science and data analysis communities. The next sections of this

report examine those factors and suggest how they might be overcome.

Current Obstacles Data mining and large-scale statistical applications for Earth science problems are, for the

most part, still in development. A fair proportion of results presented at the workshop

involved tests on small data sets as case studies. Only a few examples of mature

approaches were presented. Part of the reason for the slow development of data mining

and analysis methods since 1999 is the slow adoption and implementation of a data

access infrastructure able to handle and deliver the amount of data being acquired by

NASA’s Earth Science Enterprise. Even so, the current state of affairs can’t be blamed on

problems with data access alone. Rather, the chief obstacles identified by workshop

participants are a lack of clear scientific leadership, participation, and support from

the Earth science community, and the tendency of data miners and statisticians to want

to solve problems that are interesting to them rather than to Earth scientists.

Workshop participants identified a variety of reasons for this isolation. First, scientists as

a whole, and well-established senior scientists in particular, tend to be conservative when

it comes to adopting new approaches. To illustrate this point, Figure 2 shows the number

of publications listed in the INSPEC database containing the acronym “NDVI” –

Normalized Difference Vegetation Index, now a standard approach in remote sensing of

land surfaces. The first publication listed in the database appeared in 1985. The number

of publications remained well below ten for seven years, until 1992, when the number of

publications began to show a steady increase. Whether due to data access issues or

scientific skepticism, this shows how long it can take for a new idea to make its way into

mainstream science analysis.

17

A primary reason for slow adoption of new practices in Earth science is the prevailing

perception that, in order to obtain funding and have papers accepted for publication, it

is necessary to work with established techniques. New approaches are “risky” and are

consequently less likely to get funding. This presents a real challenge for the professional

data analysis community because domain scientists’ participation is crucial to establish

the validity of new data analytic methods in the first place. Even early-career domain

scientists, who tend to be somewhat more receptive to new ideas, are unlikely to devote

substantial amounts of time to data mining and statistical techniques that are outside their

realm of expertise, and which may eventually turn out to be ineffective.

0

10

20

30

40

50

60

70

80

90

100

1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Year

NDVI Publications by Year from INSPEC Article Database

Figure 2

Not only is the Earth science community generally risk-averse to new data analytic

methods, but also many scientists believe that they already know “enough” statistics and

are unlikely or unwilling to enlist the help of statistical experts when it comes to data

analysis problems. This is partly due to the limited statistical training that students in

Earth sciences receive as part of their graduate or undergraduate schooling. Although

basic statistical and data analytic techniques are introduced, new techniques do not

usually appear in the classroom, nor are students likely to acquaint themselves with the

statistical research literature because this is well outside their area of interest and

expertise. The result is that the majority of statistical analysis in the Earth science

literature rarely goes beyond simple, off-the-shelf methods. More advanced statistical

approaches are rare, and generally limited to research led by statisticians in the absence

of domain-relevant inspiration.

A different cultural problem impedes the infusion of mainstream data mining into Earth

science. The perception of many domain scientists is that methods such as supervised

classification and machine learning are simply “magic black boxes” that give what appear

to be correct results given certain inputs. This, however, subverts the usual paradigm for

how science is done, as shown in Figure 3 below. In this figure, the “traditional” Earth

science approach is shown by the shaded arrows. Earth science progresses by taking

observations, applying knowledge of physical principles to these observations, using this

knowledge to develop models, and allowing the models to make predictions. The data

18

mining approach also begins with the observations, feeds these observations through the

“magic black box,” which produces predictions. In either paradigm, the predictions can

feed back to the observations for comparison. However, unknown to the domain

scientists, and often the data analysis experts themselves, embedded within the “magic

black box” are potentially important clues to physical reasons that explain why data

mining produces good predictions. Data miners need to provide this information to

scientists in order to illuminate the relationships between good predictions and the

physics that underlie them. Unfortunately, very little research within the data analysis

community is currently going into uncovering these “hidden models”.

Figure 3

Another point raised by workshop participants was that science is hypothesis driven, but

data mining is data-driven. From the perspective of domain scientists, it often appears

that data mining is nothing more than a “fishing expedition”. In this view, an algorithmic

approach is developed by data analysis experts and applied “blindly” to Earth science

data, without scientific guidance, and in hopes of discovering some relevant scientific

result. There is a sense that the data analysis community has too much of a “have tool,

will travel” attitude toward Earth science data analysis and lacks appropriate direction

and focus.

This belies a major intellectual gap that must be filled: the failure of all parties to

understand the roles of data mining and statistics in Earth science research. The role of

data mining hypothesis formulation and discovery is distinct from the role of statistical

hypothesis testing, model parameterization, and inference. There is an important role

for the so-called “fishing expedition” in understanding the content of a massive data set,

but only if it follows through to the hidden (physical) model, and formulation of testable

hypotheses. To be testable, hypotheses must be expressed in terms of the data, and related

to a model of the underlying data generating mechanism. Too often the data analysis

expert thinks his or her job is over after characterizing relationships in the data. In reality,

it has just begun. The next steps require Earth scientists to participate in finding the

hidden model and articulating a hypothesis, and statisticians to formulate and test that

hypothesis.

Bridging the Gap

The main obstacles discussed above are the risk-averse culture of Earth science, along

with the structural aspects of funding and publications mechanisms that reinforce it, and

19

the failure of professional data analysts to: a) provide methods that help explain

underlying physical processes rather than merely make predictions, and b) follow through

with hypothesis formulation, testing, and inference.

Overcoming these challenges requires a concerted effort to develop a new professional

community consisting of truly interdisciplinary researchers – people who are trained in

both Earth science, and statistics or data mining. That training need not be formal, at

least in the short term, but it does require a commitment to understanding the

fundamentals of statistical analysis, data mining, and Earth science.

The new community, which might be called Earth Science Analytics, would reward

development of new data analytic methods that respond to new scientific questions, data

types, and computational capabilities. High value would be placed on new science

discoveries made possible by new data analytic techniques. Funding mechanisms that

reward this would be crucial: NASA and NSF would have to respond by setting up

special programs. NSF already has the Collaborations in Mathematics and the

Geosciences program, but the new programs would have to go well beyond collaboration.

NASA funds Earth science and technology through different offices and perhaps this is

contributing to the isolation of the two communities. Workshop participants discussed the

pros and cons of starting a new journal for publication of their results, but concluded that

using existing journals, perhaps organizing special editions, would be more practical.

NASA is in a unique position to foster a professional community dedicated to bringing

modern statistical and data mining technologies into Earth system science. NASA

sponsors both Earth science and technology, and has a great interest in seeing return of

science understanding from its data. A key finding of this workshop is that NASA should

establish a new advisory group to create and promote the efforts of this community. Only

NASA can bridge this gap by organizing and motivating the constituent experts from

all areas to work together.

20

3. Recommendations The following list of recommendations to NASA stem from the discussions above and

those at the workshop:

• NASA Program Managers in both the Earth science and technology development

areas should work together to meld modern data analysis research into mainstream

Earth science research by: 1) adding criteria to proposal opportunities that require or

reward development and/or use of modern data analysis methods in Earth science

research, and 2) establishing a funding mechanism specifically for the development of

new statistical and data mining methodologies that respond to data analysis problems

arising from the use of massive observational data sets to answer key Earth science

questions.

• NASA should take the lead in establishing a new professional community dedicated

to scientific discovery through the development and use of modern statistical and data

mining methods. This must go beyond collaboration to foster a new generation of

researchers cross-trained in both Earth science and data analysis. Through its

education programs, NASA should encourage both graduate and undergraduate

interdisciplinary training. Today’s students are tomorrow’s members of the new

community.

• NASA should form a new working group to be made up of community leaders who

work in the intersection of Earth science and statistics/data mining. This Earth

Science Data Exploration and Analysis Working Group (ESDEAWG) would:

- identify areas where current data analysis practices could be improved, either

through development of new techniques or infusion of existing ones that are

hitherto unexploited (or underexploited);

- develop a Technology Readiness Level (TRL) ladder that is appropriate for

measuring progress and maturity of data analysis methodologies;

- provide recommendations to NASA on fostering the interdisciplinary

professional community described;

- establish a set of standard hybrid statistical-physical process models (see

Section 2 of this report) that can be used to calibrate research results based on

different methodologies against one another;

- identify or create a set of benchmark data sets the community can use to test

and compare different methodologies on the same data;

- formulate a conceptual model that articulates the roles of statistics and data

mining as a means to advance Earth science in NASA’s approach to scientific

discovery;

- encourage established geoscience journals to devote special issues to new

methods for data analysis.

• NASA should hold workshops in Earth science data mining and statistics on a more

frequent basis. NASA should also sponsor a set of focused tutorials designed to cross-

21

train Earth scientists in modern statistics and data mining and to cross-train

statisticians and data miners in Earth science. These could be held in conjunction with

existing professional meetings such as AGU or AMS, or be in the form of a Gordon

Research Conference (see http://www.grc.uri.edu/).

22

Appendix 1: Call for Papers

Second NASA Data Mining Workshop:

Issues and Applications in Earth Science

May 23-24, 2006

Westin Hotel, Pasadena, CA

http://datamining.itsc.uah.edu/meeting06/index.html

Data from Earth-orbiting satellites have been accumulating at a very high rate for several

years now. In combination with in-situ observations and physical model output, this

enormous, distributed repository holds the answers to important questions about our

planet’s past, present and future. However, the information is accessible only if effective

analysis capabilities can be brought to bear. Data mining has the potential to provide

these capabilities, and, if employed in close coordination with Earth science research, can

increase the science return from NASA’s vast Earth science data collection.

Workshop objectives:

The objectives of this Second NASA Data Mining Workshop are to bring together Earth

scientists and data miners to match the needs of the scientific community to existing

capabilities provided by computer scientists and statisticians, and suggest future research

directions they may pursue to help advance Earth science research. In particular, we seek

to facilitate formation of collaborative relationships between Earth and data scientists,

and identify specific problems those collaborations can address. To those ends, we will:

1. Assess the progress that has been made in Earth science data mining and analysis

since the first NASA Data Mining Workshop held in 1999 (see

http://datamining.itsc.uah.edu/meeting/); and

2. Identify areas where data mining could potentially yield significant scientific

advances in Earth science in the near and medium term.

Call for papers:

In order to facilitate an effective exchange of ideas and meaningful discussions, the

number of participants will be limited to approximately 40 selected submissions. It is

23

important that all participants commit to being in residence for the full duration of the

workshop.

The workshop format will be a combination of oral presentations, posters, breakout

sessions, and open plenary discussions. The agenda will be organized around papers

submitted in response to the following breakdown of the two, high-level workshop

objectives listed above.

1.1 Description of successful projects. We seek papers that describe the nature of

the data mining techniques used, and how they contributed to the scientific results of the

project. We are also interested in particular characteristics of the collaborative interaction

that contributed to the project’s success, or could have been improved.

1.2 New projects. Descriptions of projects that are just getting started, but have a

significant data mining component that the authors believe will further the scientific

objectives. Papers should describe the data mining techniques used, how the authors

anticipate these techniques will contribute to the project’s scientific goals, and how the

project is organized to facilitate interaction between Earth scientists and data miners.

2.1 Unsolved scientific problems. Descriptions of difficult scientific problems that

have not been successfully addressed, but which the authors feel could potentially be

addressed by data mining methods. Papers should describe techniques that have

previously been applied, and why they have not been adequate. Authors should also

provide some evidence or justification for appropriateness of data mining as a solution,

and discuss requirements or constraints that would apply.

2.2 New applications for proven data mining techniques. Descriptions of data

mining techniques that have been successfully used in areas outside of Earth science, and

which the authors believe would be useful to Earth scientists. Papers could also include

techniques that have been used for one area of Earth science research which the authors

believe would be applicable to other areas. Authors should address why these techniques

have not been previously applied in the proposed area, and what the impediments are to

their near-term application.

2.3 New data mining techniques. Descriptions of new data mining techniques

emerging from the data mining research community that may not have been previously

applied to any real problem, but which the authors believe should be considered for use

by the Earth science community. Papers in this group can include speculative ideas for

data mining techniques that may still be in the early development stage.

Papers should be typeset in a single-column format, in 12pt font, and for letter or A4

sized paper. Papers should not exceed 4 pages (not counting references), and should be

submitted in PDF format.

Please submit your paper to Elaine Dobinson, by email to [email protected],

with “NASA Data Mining Workshop Submission” in the subject line. Also indicate in

24

your email which topic area (1.1, 1.2, 2.1, 2.2, or 2.3) your paper addresses. Please note

that due to time constraints some papers will be selected for oral presentation while

others will be posters. All accepted submissions will be published as a NASA technical

report.

The submission deadline is 11:00 pm (PST), January 16, 2006. Notification of acceptance

will be no later than March 17, 2006.

Follow on opportunity:

This NASA Data Mining Workshop precedes Interface 2006, the 38th

Symposium on the

Interface of Computing Science, Statistics and Applications, at the Westin, May 24-27,

2006. A joint reception is scheduled for Wednesday evening (May 24), and a special

Interface session will be devoted to the results of this NASA Data Mining Workshop.

NASA Data Mining Workshop participants are invited to attend Interface 2006 at the

Interface member’s registration rate. Please see:

http://www.galaxy.gmu.edu/Interface2006/i2006webpage.html

for more information.

25

Appendix 2: List of Attendees The list of attendees can also be found at http://datamining.itsc.uah.edu/meeting06/attendees.html.

Name Institution Email

Faleh Alshameri George Mason University [email protected]

Jeanne Behnke Goddard Space Flight Center [email protected]

Shyam Boriah University of Minnesota [email protected]

Kirk Borne George Mason University [email protected]

Amy Braverman Jet Propulsion Laboratory [email protected]

Michael C. Burl Jet Propulsion Laboratory [email protected]

Yang Cai Carnegie Mellon University [email protected]

Doina Caragea Iowa State University [email protected]

Becky Castano Jet Propulsion Laboratory [email protected]

Mete Celik University of Minnesota [email protected]

Yi Chao Jet Propulsion Laboratory [email protected]

Noel Cressie Ohio State University [email protected]

Elaine Dobinson Jet Propulsion Laboratory [email protected]

Saso Dzeroski Jozef Stefan Institute, Ljubljana, Slovenia [email protected]

Mary Ann Esfandiari Goddard Space Flight Center [email protected]

Mark Friedl Boston University [email protected]

Ivan Galkin University of Massachusetts Lowell [email protected]

Michael Garay Jet Propulsion Laboratory [email protected]

Stephanie Granger Jet Propulsion Laboratory [email protected]

Robert Grossman University of Illinois Chicago [email protected]

Geoffrey M. Henebry South Dakota State University [email protected]

Tin Kam Ho Bell Labs [email protected]

Rie Honda Kochi University, Japan [email protected]

Jason Hyon Jet Propulsion Laboratory [email protected]

Mehrdad Jahangiri University of Southern California [email protected]

Praveen Kumar University of Illinois Urbana [email protected]

S. Mark Leidner Atmospheric and Environmental Research, Inc. [email protected]

Francis Lindsay NASA [email protected]

Christopher S. Lynnes Goddard Space Flight Center [email protected]

Hal Maring NASA [email protected]

Jeff Masek Goddard Space Flight Center [email protected]

Zoran Obradovic Temple University [email protected]

Rahul Ramachandran University of Alabama in Huntsville [email protected]

Rob Raskin Jet Propulsion Laboratory [email protected]

Joseph Roden Battelle [email protected]

Rob Sherwood Jet Propulsion Laboratory [email protected]

Tao Shi Ohio State University [email protected]

Ashok N. Srivastava Ames Research Center [email protected]

Ranga Raju Vatsavai University of Minnesota [email protected]

Jorge Vazquez Jet Propulsion Laboratory [email protected]

Kiri Wagstaff Jet Propulsion Laboratory [email protected]

Amy Walton NASA [email protected]

Ed Wegman George Mason University [email protected]

Lisa Wilcox Goddard Space Flight Center [email protected]

Brian Wilson Jet Propulsion Laboratory [email protected]

Robert Wolfe Goddard Space Flight Center [email protected]

Jin Soung Yoo University of Minnesota [email protected]

26

Appendix 3: Final Agenda The final agenda can also be found at http://datamining.itsc.uah.edu/meeting06/agenda.html.

Tuesday, May 23rd 8:00 - 8:30 Registration & Continental Breakfast

8:30 - 10:00 Session 1: Opening Talks, Session Chair: Elaine Dobinson; Session

Recorder: Jeanne Behnke

Welcome, logistics, workshop format & objectives [Slides] - Elaine Dobinson

NASA Role in Data Mining and Desired Workshop Results [Slides] - Francis Lindsay,

NASA HQ

Plans for the Evolution of EOSDIS [Slides] - Mary Ann Esfandiari, EOSDIS

Statistics, Data Mining, and Climate Change - Ed Wegman, George Mason University

10:00 - 10:30 Break

10:30 - 12:00 Session 2: Atmosphere 1, Session Chair: Chris Lynnes; Session Recorder: Stephanie Granger

Satellite Data: Massive but Sparse [Slides]

Speaker: Noel Cressie; Authors: Noel Cressie and Tao Shi; Ohio State University

A Hybrid Object-based/Pixel-based Classification Approach to Detect Geophysical Phenomena [Slides]

Speaker: Rahul Ramachandran; Authors: Xiang Li, Rahul Ramachandran, Sara Graves, and Sunil Movva; University

of Alabama in Huntsville

Essentials for Modern Data Analysis Systems [Slides]

Speaker: Mehrdad Jahangiri; Authors: Mehrdad Jahangiri, and Cyrus Shahabi; University of Southern California

12:00 - 1:30 Lunch

1:30 - 3:00 Session 3: Climate, Session Chair: Becky Castano; Session Recorder: Chris Lynnes

Knowledge Discovery From Global Remote Sensing and Climate Data: Results from Supervised and Unsupervised

Data Mining [Slides]

Speaker: Mark Friedl; Authors: Mark Friedl and Carla Brodley; Boston University

Characterizing Variability and Multi-resolution Predictions of Virtual Sensors [Slides]

Speaker: Ashok N. Srivastava; Authors: Ashok N. Srivastava and Rama Nemani; NASA Ames Research Center

The Application of Clustering to Earth Science Data: Progress and

Challenges [Slides]

Speaker: Shyam Boriah; Authors: Michael Steinbach, Pang-Ning Tan, Shyam Boriah, Vipin Kumar, Steven Klooster,

and Christopher Potter; University of Minnesota

3:00 - 3:30 Break

3:30 - 5:00 Session 4: Surfaces, Session Chair: Amy Walton; Session Recorder: Rebecca Castano

Spatiotemporal Data Mining for Monitoring Ocean Objects [Slides and Demo]

Speaker: Yang Cai; Authors: Yang Cai, Karl Fu, Daniel Chung, Richard Stumpf, Timothy Wynne, and Mitchell

Tomlison; Carnegie Mellon University

Temporal Modeling and Missing Data Estimation for MODIS Vegetation Data [Slides]

Speaker: Rie Honda; Author: Rie Honda; Kochi University, Japan

Using Land Surface Phenology for Spatio-temporal Mining of Image Time Series: A Manifesto [Slides]

Speaker: Geoffrey M. Henebry; Authors: Geoffrey M. Henebry and Kirsten M. deBeurs; South Dakota State

University

Recent HARVIST Results: Classifying Crops from Remote Sensing Data [Slides]

Speaker: Kiri Wagstaff; Authors: Kiri Wagstaff and Dominic Mazzoni; Jet Propulsion Laboratory

5:30 - 7:00 Poster Session & Reception

Automated Metadata for Image Mining

Presenter: Faleh Alshameri; Authors: Faleh Alshameri, and Ed Wegman; George Mason University

Clustering Spatio-Temporal Patterns using Levelwise Search

Presenter: Raj Bhatnagar; Authors: Abhishek Sharma and Raj Bhatnagar; University of Cincinnati

Automated Wildfire Detection Through Artificial Neural Networks [Slides]

Presenter: Kirk Borne; Authors: Kirk Borne (GMU), Jerry Miller (NASA), Brian Thomas (UMD), Zhenping Huang

(UMD), and Yuechen Chi (GMU); George Mason University

Sensory Stream Data Mining on Chip

Presenter: Yang Cai; Authors: Yang Cai and Yong X. Hu; Carnegie Mellon University

Knowledge Discovery from Disparate Earth Data Sources [Slides]

27

Presenter: Doina Caragea; Authors: Doina Caragea and Vasant Honavar; Iowa State University

Parameter Estimation for the Spatial Autoregression Model: A Rigorous Approach

Presenter: Mete Celik; Authors: Mete Celik, Baris M. Kazar, Shashi Shekhar, and Daniel Boley; University of

Minnesota

Intelligent Archive Technologies for NASA/IMAGE Radio Plasma Imager Data [Slides]

Presenter: Ivan Galkin; Authors: I. Galkin, G. Khmyrov, A. Kozlov, B.W. Reinisch, R. Benson, and S. Fung; U Mass

Lowell

Predicting Forest Stand Height and Canopy Cover from LANDSAT and LIDAR Data Using Decision Trees

Presenter: Saso Dzeroski; Authors: Saso Dzeroski, Andrej Kobler, Valentin Gjorgjioski, and Pance Panov; Jozef

Stefan Institute, Ljubljana, Slovenia

Selection Technique for Thinning Satellite Data for Numerical Weather Prediction [Slides]

Presenter: Mark Leidner; Authors: Christian Alcala, Ross N. Hoffman, and S. Mark Leidner; Atmospheric and

Environmental Research, Inc.

Show and Tell: A Seamlessly Integrated Tool for Searching with Image Content and Text

Presenter: Olfa Nasraoui; Authors: Zhiyong Zhang, Carlos Rojas, Olfa Nasraoui, Hichem Frigui; University of

Louisville

Asynchronous Data Mining Tools at the GES-DISC

Presenter: Christopher S. Lynnes; Authors: Long B. Pham, Stephen W. Berrick, Christopher S. Lynnes and Eunice K.

Eng; NASA - GES DAAC, Goddard Space Flight Center

Miner: A Suit of Classifiers for Spatial, Temporal, Ancillary, and Remote Sensing Data Mining

Presenter: Ranga Raju Vatsavai; Authors: Ranga Raju Vatsavai and Shashi Shekhar; University of Minnesota

The LBA-ECO Metadata Warehouse and Its Implications for Data Mining Initiatives

Presenter: Lisa Wilcox; Authors: Lisa Wilcox, Amy L. Morrell, and Peter C. Griffith; NASA Goddard Space Flight

Center

Data Mining Via Smart Grid Workflow: The SciFlo Dataflow Execution Network

Presenter: Brian Wilson; Authors: Brian Wilson, Dominic Mazzoni, Gerald Manipon, and Benyang Tang; Jet

Propulsion Laboratory

A Framework for Mining Co-evolving Spatial Events

Presenter: Jin Soung Yoo; Authors: Jin Soung Yoo and Shashi Shekhar; University of Minnesota

Wednesday, May 24th

8:00 - 8:30 Continental Breakfast

8:30 - 10:00 Session 5: Land Cover, Session Chair: Mike Burl; Session Recorder: Rahul Ramachandran

Multiscale Analysis Of Data: Clusters, Outliers and Noise - Preliminary Results

Speaker: Robert Grossman; Authors: Chetan Gupta and Robert Grossman; University of Illinois Chicago

Unraveling the Dominant Influences on the Evolution of Land-Surface Variables using Data Mining [Slides]

Speaker: Praveen Kumar; Authors: Praveen Kumar, Peter Bajcsy, Amanda B. White, Vikas Mehra, David Tcheng,

David Clutter, Wei-Wen Feng, Pratyush Sinha, and Richard Robertson; University of Illinois Urbana

Adopting Semi-supervised Learning Algorithms for Mining Remote Sensing Imagery: Summary of Results and Open

Research Problems [Slides]

Speaker: Ranga Raju Vatsavai; Authors: Ranga Raju Vatsavai, Shashi Shekhar, and Thomas E. Burk; University of

Minnesota

10:00 - 10:30 Break

10:30 - 12:00 Session 6: Atmosphere 2, Session Chair: Rahul Ramachandran; Session Recorder: Brian Wilson

An Operational Pixel Classifier for the Multi-angle Imaging SpectroRadiometer (MISR) Using Support Vector

Machines [Slides]

Speaker: Michael Garay; Authors: Dominic Mazzoni, Michael Garay, and Roger Davies; Jet Propulsion Laboratory

Data Mining Support for Aerosol Retrieval and Analysis – Project Summary

Speaker: Zoran Obradovic; Authors: Zoran Obradovic, Bo Han, Qifang Xu, Yong Li, Amy Braverman, Zhanqing Li,

and Slobodan Vucetic; Temple University

Polar Cloud Detection using MISR and MODIS Data [Slides]

Speaker: Tao Shi; Authors: Tao Shi, Bin Yu, Eugene E. Clothiaux and Amy J. Braverman; Ohio State

12:00 - 1:30 Lunch

1:30 - 3:00 Session 7: Application of Data Mining and Statistics to Earth Science Research, Session Chair and Panel

Moderator: Amy Braverman; Session Recorders: Chris Lynnes, Stephanie Granger

NASA's Approach to Earth Science [Slides] - Hal Maring, Radiation Sciences Program, Science Mission Directorate,

NASA

28

Panel Discussion - How to promote the infusion of data mining and statistical technologies into Earth science

Panel Members: Sara Graves, Francis Lindsey, Jeffrey Masek, Hal Maring, Ed Wegman, and Amy Walton

3:00 Workshop Adjourns – Report Writing Begins (Program Committee, Panel Members, & Volunteers)

29

Appendix 4: Summary of Workshop Presentations

Submitted abstracts and workshop presentations are available on

http://datamining.itsc.uah.edu/meeting06/agenda.html.

There were a total of 37 papers submitted to the workshop with 16 selected for

presentations and 15 as posters. The papers selected for presentation covered a wide

spectrum of scientific data mining applications relevant to NASA’s Earth science

mission. The papers covered different mining techniques, problems and themes within

Earth science. Consequently, technical presentations for the workshop were organized

based on science themes: Atmosphere, Climate, Surfaces and Land Cover.

Session 2: Atmosphere 1 N. Cressie described the problem of sparseness in massive satellite datasets and presented

an optimal statistical method for smoothing these massive, sparse datasets. His work

with Tao Shi focused on creating a Multiangle Imaging Spectro Radiometer (MISR) level

3 aerosol optical depth product using Fixed Rank Kriging (FRK). Traditional kriging

methods require inversion of covariance matrices and can be computationally expensive.

FRK was presented as an efficient smoothing method that considerably reduced the mean

squared prediction errors as compared to other spatial smoothing methods. His

presentation emphasized the importance of using appropriate statistical techniques (e.g.,

covariance-function estimation) during the data preparation step, before carrying out

optimal smoothing. The importance of handling uncertainty during the preparation and

analysis was highlighted.

R. Ramachandran described a collaborative mining project focused on detecting

geophysical phenomena, specifically fronts, in numerical model output. The science goal

for this project was to create a climatology of targets that could be used for further

analysis, including model validation and verification. A hybrid methodology that

combined both pixel level and object level mining was presented. An unsupervised

clustering algorithm was used to perform soft classification to identify possible frontal

pixels. A hierarchical thresholding technique was then used in conjunction with a Bayes

classifier to detect different regions as fronts. The presentation highlighted the

importance of the using the domain experts during the mining process. In particular,

the domain experts helped during the data preparation step and in creating the “truth”

data.

M. Jahangiri presented a data storage and retrieval system that provides scientists the

functionality of performing complex statistical queries. The system is specifically

designed to handle very large multidimensional datasets such as one produced by the

Atmospheric Infrared Sounder (AIRS) instrument. The querying capability of this

system is built on discrete wavelet transforms to provide approximate answers with

progressively increasing accuracy to the users. The basic idea is to provide fast queries

by converting the data to wavelet coefficients then querying regions and reconstructing

the results. These queries work on multiple resolutions, thus a user is able to summarize

30

at various levels of abstractions on the fly. One of the obstacles faced by authors while

designing and developing the system was the sparseness of the information as compared

to the file size.

Session 3: Climate M. Friedl presented results and lessons learned from mining large volume and high

dimensional EOS datasets. The mining techniques applied in his research applications

covered both supervised and unsupervised classifiers. A supervised classification

technique, specifically a decision tree, was used to create a global land cover model using

Moderate Resolution Imaging Spectroradiometer (MODIS) datasets. Unsupervised

methods such as Independent Component Analysis and Canonical Correlations were used

to identify spatio-temporal patterns of joint ecosystem-climate variability at global scales

using gridded climate and remote sensing data sets. Even though these were successful

applications, a number of lessons pertaining to scientific data mining were learned. The

data and not the technique are of the paramount importance in supervised learning.

The data dictate issues like unbalanced training, proper feature selection, normalization,

etc. The use of unsupervised learning techniques needs to be driven by a science

hypothesis otherwise the whole analysis process is in danger of being deemed a “fishing

expedition.” The presenter emphasized the lack of mature data mining toolkits that are

capable of handling large data sets and the need to have collaborative teams including

both data analysis experts and domain scientists to prevent scientists from doing naïve

analysis and data analysis experts from doing naïve science.

A. Srivastava presented the concept of a Virtual Sensor. This concept is based on the

assumption that there are potentially nonlinear relationships between spectra, and that one

can reconstruct a signal given available redundant data. Data mining techniques can be

used to learn these nonlinear relations for the reconstruction. Once learned, these

relationships can be used to backcast and make multi-resolution predictions. Backcasting

involves estimating the value of unmeasured spectra given other measured spectral

components. Multi-resolution prediction entails estimating high spatial resolution spectra

based on relationships learned from low resolution measurements. Kernel methods were

used for learning and the results based on the initial experiments support the concept of a

Virtual Sensor.

Application of clustering to Earth science data was presented by S. Boriah. Clustering

techniques were applied to discover climate indices. The detection of climate indices is

important in order to find relationships with other climate phenomena such as El Niño. A

new density based clustering technique called Shared Nearest Neighbor (SNN) was

designed for this application. SNN technique did a better job detecting the known

climate indices, compared to traditional approaches that used Singular Value

Decomposition. The presentation also delved into challenges of clustering Earth

science data. These challenges arise from the fact that Earth science phenomena evolve

in both space and time. Consequently, there is a need to develop new clustering

algorithms that can find dynamic clusters. These new algorithms need to handle the

inherent spatiotemporal information embedded in the Earth science datasets.

31

Session 4: Surfaces A NASA/ESTO funded project on spatiotemporal data mining for monitoring ocean

objects was presented by Y. Cai. The case study was based on detecting and predicting

algae blooms. The components required to achieve this objective include object

segmentation and tracking, and the use of mining techniques to make predictions. A

spatial density clustering technique was used in conjunction with a convex hull to

segment the interesting regions and ignore regions with missing data. A shape correlation

filter based on the Fast Fourier Transform was used to track the object in time. A neural

network was trained to make spatiotemporal predictions. The input to the neural network

included a set of historical data of the object and additional physical parameters such as

wind and temperature. The neural network thus predicted the spatiotemporal evolution of

the object. The use of cellular automata was also explored to create simulations.

Coupling the multi-physics models into the data mining models was one of the

challenges presented to audience. The presentation emphasized the need to have a key

domain expert involved both in the data mining project and in developing data mining

techniques.

R. Honda presented a methodology for temporal modeling and missing data estimation

for MODIS vegetation data. The common obstacles for spatiotemporal mining from EOS

satellite data are sparseness, noise and missing data. The methodology presented used a

Maximum a Posterior (MAP) approach to fit a temporal model. The MAP estimate was

then combined with spatial information using an ensemble of regression trees (Random

Forests). The results presented showed that the temporal model built using MAP did

better than the piecewise logistic function model and did not require any additional

preprocessing of the data, such as smoothing or segmentation.

An exposition of issues for spatio-temporal mining of image time series was presented by

G. Henebry. These issues centered around synoptic ecology, which focuses on the

interface between the land surface and the atmosphere and has the science need to

distinguish between unusual changes from the expected variation. Three questions

fundamental to this issue were presented. What is the appropriate unit of analysis? It was

posited that the scale depends on the question at hand but is certainly not the “pixel”

level. The second question looked at what baseline needs to be used to make

comparisons? The presenter postulated that current statistical techniques for modeling

spatio-temporal data are focusing on the images. However, what is of scientific interest

are not the images themselves, but rather the process or patterns they portray. The author

presented a simple analogy to explain this point. One could create a baseline for a movie

by sampling a set of frames, but a better baseline would be the one that captures the plot

of the movie. The third question centered on conducting change analysis. Five steps to

conduct such analysis were presented: detection, quantification, assessment, attribution

and consequences.

K. Wagstaff presented results from the Heterogeneous Agricultural Research Via

Interactive, Scalable Technology (HARVIST) project. The goal of the project is to

develop a standard toolkit to enable interactive analysis of relationships among multiple

32

science data products, and provide capabilities such as clustering, classification, and

statistical analysis for prediction. Classification and prediction of crop yield was the

specific science application used to demonstrate the tool. Ground truth was used to train

a Support Vector Machine (SVM) classifier for MISR satellite data. The ground truth

consisted of measurements made for ten crops at 99 different sites in California. The

presenter mentioned the interactive nature of the tool was appealing to the scientists and

that one-on-one, in person interaction was much more effective in establishing a

working relationship between the scientists and data miners, as compared to email,

phone calls or the exchange of papers.

Session 5: Land Cover

R. Grossman presented a multi-scale analysis of large, distributed data sets in resource-

constrained environments. Three challenges in resource constrained data mining were

presented and possible solutions were advocated by the means of case studies. Accessing

large volumes of data was the first challenge. A new application level network protocol

was put forth as a possible solution to exploit high bandwidth to promote distributed

mining. The second challenge focused on modeling heterogeneous data. An algorithm

based on hierarchical decomposition was presented. The algorithm allows processing at

different scales (levels of decomposition). The third challenge dealt with deployment

issues in decision support systems for real world uses. A traffic alert system that uses

Doppler radar data in real time was presented as an example.

Use of data mining techniques to discover dominant influences on the evolution of land

surface variables such as vegetation indices, temperature, and emissivity was presented

by P. Kumar. The dependence of these surface variables on invariant factors such as

topography, soil characteristics and land cover type was also explored. A regression tree

induction was used to model the dependence of greenness on different parameters using

MODIS data. The rules generated by the tree induction algorithm were then analyzed to

discover the dominant influences. The presentation described the data mining system

(GeoLearn) that combines an Image2Knowledge (I2K), Data2Knowledge (D2K), and an

ArcGIS engine. The challenge for such a system is handling the heterogeneity of

formats, scale, resolution, sampling, etc. of the Earth science datasets. The key theme

highlighted during this presentation was that a successful data mining application is a

cyclically coupled process between data, science drivers, and the mining technology.

R. R. Vatasavai presented the use of semi-supervised learning algorithms for mining

remote sensing data. Acquiring labels or truth data for supervised classification is a

costly, time consuming and labor intensive process. It can be error prone or subjective,

as in the case where different experts classify a region differently. Semi-supervised

classification attempts to learn using partially labeled training data. An algorithm for

semi-supervised classification and experimental results from the algorithm were

presented. One of the interesting suggestions made during this presentation was the

creation of benchmark datasets. There are well-known benchmark datasets archived at U.

of California, Irvine that are used by the mining community to develop new algorithms.

It was suggested that NASA create and provide a set of benchmark Earth science

33

dataset to the mining community to foster the development of new algorithms geared

towards geoscience data.

Session 6: Atmosphere 2 M. Garay presented the use of Support Vector Machines (SVM) with data from the

Multi-angle Imaging SpectroRadiometer (MISR) to classify clouds. Cloud detection and

screening are important steps towards achieving the MISR science objectives. However,

the current, physically derived, MISR cloud masks have limitations. Consequently, a

SVM was used to create a cloud mask, detect thin cirrus, and perform multiclass

classification. The truth data used to train the classifier was generated by two

independent experts. The SVM classifier produced good results and the code is being run

at the Langley DAAC as a part of the MISR standard processing. SVMs provide number

of advantages compared to other classification techniques. They tend to balance accuracy

and the ability to generalize. However, the inability to explain the underlying physical

reasons for the SVM success limits their acceptance by the broader Earth science

community.

Results from a study that applied data mining techniques for Aerosol Optical Thickness

(AOT) retrieval were presented by Z. Obradovic. First, a neural network was used to

analyze the influence of different attributes on the AOT retrieval accuracy. Then, a

decision tree was employed to analyze the conditions where the MISR or the MODIS

AOT retrievals could be improved by using neural network predictions. Interesting rules

were discovered such as: in presence of clouds, the neural network predictions

outperform MISR AOT retrievals at 72% of locations; while in desert regions, under

certain conditions, neural network predictions can improve MODIS retrievals at 80% of

the locations. Finally, the MODIS and MISR were used to generate combined AOT

retrievals, with improved accuracy.

T. Shi presented a methodology for improving polar cloud detection by fusing MISR and

MODIS data. First, an experimental MISR polar cloud detection algorithm was

developed. This algorithm, Enhanced Linear Correlation Matching (ELCM), used three

MISR angular signatures to detect polar cloud. A consensus mask was then created by

fusing both the MISR experimental and the MODIS operational cloud mask. The

consensus mask was used to train a classifier for MISR data. A Quadratic Discriminate

Analysis classifier was used. This classifier models each class density as a multivariate

Gaussian distribution. The use of fused data for training gave better results. The

effective collaboration between scientists and statisticians was again emphasized as

crucial for the success of the project.

Session 7: Application of Data Mining and Statistics to Earth Science

(Panel Discussion) Hal Maring, Program Scientist and Manager for the Radiation Science Program, gave a

science manager’s perspective on science research in NASA. He stressed that NASA

does science that is policy relevant and policy driven. The whole research effort is to

build a comprehensive Earth System model where the observations are used to improve

process models, and the process models are coupled together into decision support

34

systems that can be used by policy managers. There are challenges in utilizing the data

properly because of disparities in space, time, parameter, uncertainty, source, data type,

etc. There are other challenges in fundamental climate research, such as characterization

of uncertainty, setting climate observation requirements and priorities. The

characterization of uncertainty of the models is essential if these models are going be

used to make predictions about the future.

Amy Walton described different funding possibilities in ESTO for projects that apply

data mining techniques to Earth science problems. ESTO seeks cutting edge, but

science-driven, technologies. The proposals have to have a clear connection between

innovative technology and science needs. The proposals are peer reviewed by both the

science and the technology community to provide a balance.

Jeff Masek summarized a scientist’s views on data mining after listening to all the

technical presentations. He argued that conventional science follows the path where

observations feed models that generate predictions. These predictions are compared with

observations for validation. The models used are built on physics. Data mining

circumvents the traditional path as the observations themselves are used to build the

model and generate predictions. Data mining can be critical for success in science

problems where the physics is not well understood or the models are too complex or the

relationships between parameters cannot be conceptualized by scientists. Some

possible applications in terrestrial ecology could include finding the relationship between

vegetation dynamics and climate; prediction of the future trajectories of land cover

changes such as urbanization, deforestation; and the intelligent assimilation of land plot

data (biodiversity, flux towers) with remote sensing data. His presentation stressed that

data miners should look for science problems where the physical models fail or disagree.

The need for a tighter collaboration between scientists, statisticians and data miners was

again emphasized. These collaborative teams should focus on the physical process model

and the interpretation of the mined discovery. Mining applications should also focus on

datasets that are high dimensional and high volume, such as the entire MODIS dataset,

NCAR climate records, etc.

Poster Session The poster presentations were given during a reception following Session 4. There were

15 posters selected for the workshop. However, two were not presented due to last

minute illnesses.

There were five posters submitted on data mining tools. I. Galkin presented a poster on

intelligent archive technologies for NASA/Image radio plasma imager data. C. Lynnes

presented data mining tools at the GES-DISC that allows batch mode processing on large

volumes of data. A suite of classifiers for remote sensing data was presented by R. R.

Vatsavai. SciFlo, a tool that allows users to compose and execute mining workflows on

the grid, was presented by B. Wilson. A tool for searching images by O. Nasraoui, based

on content rather than metadata, was chosen for presentation, but unfortunately the

speaker could not attend.

35

Three posters focused on different applications of data mining to science problems. M.

Leidner demonstrated the use of wavelets to thin satellite data for numerical weather

prediction. S. Dzeroski presented a study that used a decision tree classifier to predict

forest stand height and canopy cover using LANDSAT and LIDAR data. An application

of neural network for wildfire detection was presented by K. Borne.

There were two posters that focused on metadata related issues. L. Wilcox presented the

LBA-ECO metadata warehouse and its implications on data mining. A method to

augment existing metadata using automated methods was presented by F. Alshameri.

In addition, there were posters that covered other interesting topics. An FPGA based

reconfigurable sensory stream data mining processor was presented by Y. Cai. A general

strategy that allows algorithms to learn from semantically heterogeneous data sources and

allow knowledge discovery was presented by D. Caragea. M. Celik presented a faster,

scalable version of a spatial auto regression model (NORTHSTAR) for spatial data

analysis. J. S. Yoo presented a mathematical framework for mining co-evolving spatial

events. An algorithm for mining spatio-temporal patterns was to have been presented by

R. Bhatnagar, but unfortunately the speaker was ill and unable to attend.

36

Appendix 5: Current and Emerging Technology Themes

In alignment with NASA’s Earth Science program, the workshop sessions were

organized around science discipline. However, as the workshop proceeded, it was clear

that: 1) within a given science discipline, a broad variety of data mining techniques could

be applied and 2) across different science disciplines, the same data mining technique

might be used4. Thus, reorganizing (or reinterpreting) the workshop content according to

technology theme is a potentially valuable exercise for gaining insight into the current

state-of-the-practice and predicting directions for future work.

Table 5.1 takes this exercise a step further and categorizes the workshop papers both by

science focus area and technology theme. The rows of the table correspond to technology

theme and the columns correspond to Earth science focus area as defined at

http://science.hq.nasa.gov.earth-sun/science/. An additional column, denoted by a “*”,

was added to capture papers that were not closely tied to a particular science focus area.

In some cases, papers were assigned both to a specific discipline and to the “*” category,

e.g., a paper that proposed a general method that could clearly be applied in multiple

focus areas, but was evaluated within the context of a specific focus area. Also, note that

the rows of the table are grouped into three main divisions: process model (red), data

model (green), and infrastructure/tools (blue).

Current State-of-the-Practice

From Table 5.1, several points are clear. First, a significant number of papers involve

infrastructure activities and end-user tools that are not tied to any specific science focus

area. Infrastructure activities include, for example, data management methods for

organizing, storing, querying, and transferring large volumes of data, creating and

exploiting ontologies and metadata, and methods for parallel and distributed data mining.

End-user tools allow for retrieval, visualization, and other interactive operations with

large datasets. Although the initial reaction may be that technologies should be more

tightly coupled to science focus areas, it must be acknowledged that there is indeed much

in common across focus areas (especially at this low-level of data manipulation and

processing); hence, factoring the commonalities into infrastructure that can support

multiple focus areas is reasonable. Researchers pursuing such efforts are cautioned,

however, that connecting their efforts back to the science datasets and to the needs of the

scientists themselves is critical to insure useful products emerge. A tool without a user or

use is likely a wasted effort.

Another observation from the Table 5.1 is that the most dominant application of data

mining and statistics to Earth science data involves the use of supervised learning

techniques for land cover classification (under the Carbon & Ecosystems column) with

such applications now becoming fairly mature. In fact, very similar methods have been

4 As a concrete example of the latter, support vector machines (SVMs) were used both for land cover

classification and atmospheric cloud classification.

37

applied to cloud classification and incorporated into the Langley DAAC for routine

MISR data processing.

Atmospheric

Composition

Carbon &

Ecosystems

Climate Surface5 &

Interior

Water Weather *

Physics-based

process model

14, 30 7

Probability-based

process model

1 1

Change, anomaly,

novelty

9, 27

Clustering 4, 7, 10, 22,

26 2, 4, 6 6 2 12

Discrete Events 17 7, 26, 13 2, 25 2 Focus of attention,

search

17 11 2 20

Fusion 30, 31 7, 10, 13, 15,

18, 22 15 19 5, 20

Missing data 1 4, 8, 18, 28 4 1, 5, 28

Multires., wavelets 1, 3 7, 26 19 1

Object detection,

segmentation

7, 13

Prediction

(classification &

regression)

1, 14, 30, 31 4, 7, 8, 10,

13, 16, 18, 22,

28

2, 4, 29 29 2, 29 1, 5

Spatial interp.,

spatial context

1, 30 16, 22, 28 25 1, 28

Spatio-temporal

pattern ident.

4, 9, 27 2, 4, 6 6 2 12

Tracking 7 2, 6 6 2 Uncertainty 1 1, 5

Infrastructure 3, 14, 17 15, 23, 26 15 3, 14,

15, 21,

24 Tools for end-user 3, 17 10, 13, 22, 27 20, 21

Table 5.1. Approximate classification of workshop papers by technology theme (row) and science

focus area (column). The column headings denote the six Earth science focus areas defined by NASA,

plus an additional column, designated “*”, which is intended to capture papers that were not closely

tied to a particular focus area. The row headings indicate technology themes or approaches. Note

that the rows of the table are grouped into three main divisions: process model (red), data model

(green), and infrastructure/tools (blue). The cell entries are references to particular papers (see key

in footnote below6).

5 The Surface and Interior Structure focus area emphasizes natural hazards such as earthquakes, landslides,

erosion, floods, and volcanic eruptions. 6 Author key for Table 5.1: (1) Cressie, (2) Ramachandran, (3) Jahangari, (4) Friedl, (5) Srivastava, (6)

Boriah, (7) Cai/Fu, (8) Honda, (9) Henebry, (10) Wagstaff, (11) Alshameri, (12) Bhatnagar, (13) Borne,

(14) Cai, (15) Caragea, (16) Celik, (17) Galkin, (18) Dzeroski, (19) Hoffman, (20) Nasroui, (21) Lynnes,

(22) Vatsavai/Shekhar/Burk, (23) Wilcox, (24) Wilson, (25) Yoo, (26) Grossman, (27) P. Kumar, (28)

Vatsavai/Shekhar, (29) Garay, (30) Obradovic, (31) T. Shi.

38

Another popular application for data mining and statistics involves the use of clustering

techniques for spatio-temporal pattern identification, usually at a global or regional scale

with most of the activity in the Climate and Carbon/Ecosystems focus areas. This activity

likely reflects the broad interest in climate change and its impact on and interactions with

ecosystems, as well as the suitability of EOS data for doing these types of studies.

In contrast, we note that certain science focus areas are under-represented. In particular,

the Surface (Hazards) and Interior Structure, Water, and Weather areas had three or fewer

papers each. It is not clear whether this reflects lack of suitable data or algorithms for

addressing the science problems in these areas, or is merely an artifact (e.g., of the

workshop advertising or agency funding priorities).

The growing importance of data fusion is also illustrated by the table. Nearly a dozen

papers involve some form of data fusion. Joint use of MISR and MODIS data was

particularly common, but other applications involved the fusion of some combination of

satellite, airborne, and in-situ sensors. The recent NASA ROSES AIST call, which

emphasized the development of Sensor Web technologies, is likely to drive further work

in data fusion.

If we drill down into the individual papers, we see that a wide variety of techniques are

currently being applied to Earth science datasets, including:

• Canonical Correlation Analysis

• Classifiers based on parametric class-conditional models (Bayes classifiers with

Gaussian models)

• Clustering (k-means, spatial density clustering, Shared Nearest-Neighbor)

• Correlation-based object tracking, particle filter tracking

• Decision and Regression Trees including Random Forests (ensembles of

Regression Trees)

• Expectation-Maximization (semi-supervised learning)

• Fixed-rank kriging (FRK) for spatial interpolation/prediction

• Kernel methods (Support Vector Machines (SVM), Support Vector Regression)

• Markov Random Fields to model spatial context

• Neural Networks (feedforward neural nets, RBF, SOM, ensembles of neural nets)

• Principal Components Analysis (PCA), Independent Components Analysis (ICA)

• Spatial Autoregression Models

• Wavelets

Most of these techniques have been around for some time now (typically ten years or

more!), which raises some interesting issues. First, applying established algorithms to

Earth science data (even with a novel arrangement of component steps) generally does

not create a story that is compelling enough for publication in computer science venues.

Compounded with the difficulty of publishing in physical science venues, this issue

creates a real bind for researchers working in this area. Second, one wonders whether the

existing methods are sufficient for the problems at hand (meaning there is no need to rely

on more cutting edge technology) or does this fact just reflect the speed at which new

algorithms trickle down from theory into practice. There is also a likely bias that favors

39

using algorithms that have been released in public toolboxes. For example, well-tuned

and tested versions of SVM, such as libsvm and SVM-Light, have certainly contributed

to the rapid growth of applications using this technique. Similarly, the C4.5

implementation of decision trees and the WEKA Toolkit, which contains a variety of

machine learning algorithms in Java, have made it easier for researchers to pull tools off-

the-shelf. Again, some caution must be applied. To paraphrase a comment by a science

colleague, “We must be careful to distinguish between science-driven tools and tool-

driven science.”

Future Themes

Given the current state-of-the-practice, can we identify current theoretical developments

or problem areas that are likely to have significant impact five years down the road?

Clearly, there are many regions in Table 5.1 that are empty or sparsely populated. These

cells indicate opportunities where data mining and statistical methods can potentially

create breakthroughs. Some of these we have already noted, but one of the more

promising areas for future work is the incorporation of physically-inspired process

models into data mining endeavors. In Table 5.1 we see that there were very few papers

at the workshop that attempt to incorporate domain knowledge directly into the process

model. Many of the scientists at the workshop lamented that the data mining and

statistical approaches may do a good job of modeling the data, but do not provide insight

into why they’re doing a good job or a connection back to underlying physical processes.

By combining a physically-inspired process model with an observational (data) model

and characterizing the uncertainties within these two models, we may better be able to

make inferences about the underlying physical processes. Section 2 expands upon this

idea, providing a more detailed conceptual model and discussion.

The following are some additional areas where we see both technical challenges and

opportunities.

• Evaluation of algorithm performance in the presence of spatial correlation –

A number of approaches now use supervised or semi-supervised learning, in

which a training set is used to set key algorithm parameters (e.g., the set of

support vectors and weighting coefficients in an SVM classifier). When

evaluating the effectiveness of such approaches on separate data, researchers must

be wary of the effects of spatial correlation. Tobler’s famous First Law of

Geography states that “Everything is related to everything else, but near things are

more related to each other.” In the context of evaluating algorithms, this means

that we must be mindful of the spatial relationship of the training data to the test

data. An experiment by Wagstaff et al demonstrated this point quite precisely:

when they used leave-one-pixel out training and testing, accuracy was 91%;

however, when they split the training and testing data spatially, accuracy dropped

to 82%.

• High cost of training labels – As machine learning methods make greater

penetration into Earth science applications, there is a growing demand for labeled

training examples. It is quite time-consuming (and mind-numbing) for a person to

manually label large quantities of data. Several trends may offer some relief in

this area. First, semi-supervised techniques that use a small amount of labeled

40

training data and a large amount of unlabeled data to bootstrap a classifier, such

as in the paper by Vatsavai et al., offer one approach for leveraging a small

number of training labels. Active learning methods, which can focus a human

expert toward labeling the examples that would be most beneficial to a learning

algorithm, are also attractive for getting the most benefit from the science expert’s

time. Another interesting approach to this issue was suggested in the paper by

Dzeroski et al., who used simultaneous observations with a low-coverage high-

resolution instrument and a high-coverage, low-resolution instrument to create a

training set. The virtual sensor method of Srivastava might also be exploited for

this purpose. Yet another approach involved the use of a numerical simulator to

generate training examples. This idea was used by several authors (e.g.,

Ramachandran et al., Cai), but has potential for broader use.

• Contextual variables – Several papers showed that data distributions can be

highly dependent on the values of auxiliary contextual variables. For example, the

regression tree work of P. Kumar applied to the growth pattern of EVI (Enhanced

Vegetation Index) showed substantially different influences (meteorological,

topographic, soil) conditional on the month of the year. Methods to identify

relevant contextual variables and ascertain their influence (and causality) on

processes are needed. Methods for understanding and incorporating spatial

context are also expected to be important.

• Data Fusion – Although data fusion is already being widely used in analysis of

Earth science data, we expect this trend to expand and see more use of physical

process models (numerical simulators), in-situ sensors, ontologies, and cross-

focus-area applications.

• Onboard and In-Stream Algorithms – As algorithms become more mature and

reliable we expect to see a larger infusion of data mining and statistics into

mainstream science processing, becoming a standardized part of the data

distribution path. The paper by Mazzoni et al. noted that their SVM for cloud

classification has been incorporated into the Langley DAAC. Going a step further,

technology demonstration experiments such as Autonomous Sciencecraft

Experiment have shown that pushing algorithms onboard spacecraft can lead to

novel science opportunities.

• Discovery and Change Detection – As we build up an extended record of

historical satellite observations, the importance of recognizing changes with

respect to the historical record will become even more important. Many current

approaches operate in a point vs. point mode, where a single observation is

compared against a single historical observation. Creating a richer model of the

historical data, possibly by incorporating physical models, may allow the

separation of expected change from change that is interesting or novel. Discovery

of patterns or trends that were unexpected or anomalous is an area with potential

growth, but somewhat difficult to rigorously evaluate.

• Decision support systems – Scientists at the workshop voiced concerns that data

analysis results were not being linked back to the underlying physical processes.

A similar concern involves connecting analysis results back to policy decisions,

providing manageable and effective decision support systems that can fuse

multiple data sources, models, and analysis into actionable recommendations.

41

• Metric-driven Progress – One problem the field currently faces is that different

techniques are applied on different datasets using different parameters and

evaluation conditions. It is difficult to determine which algorithms are best suited

for a particular situation or whether a new algorithm is doing better than

established techniques. In many other fields, such as handwritten digit

recognition, human face detection, human face identification, object recognition,

and segmentation, the emergence of standard benchmark datasets has led to

significant progress, allowing the effectiveness of algorithms to be evaluated,

compared and improved.

Generalizing the Role of Data Mining Methods in Earth Science Research

The roles the above technologies play in Earth Science research as a whole depend on the

particular method and application, but can be broadly categorized into two classes:

1. Data Characterization and Feature Detection

2. Causal Analysis and Anomaly Discovery

Data characterization and feature detection includes such technologies as classification

techniques, kriging and uncertainty analysis, and clustering and statistical

summarizations. Typically, the primary focus is on providing a more understandable

characterization (or view) of the underlying structure of large amounts of science data.

While they are not generally targeted toward directly extracting scientific results, these

methods can make massive data sets comprehensible and thus tractable to further

scientific analysis. An important aspect of these techniques is that the problem to be

solved is generally well-defined. Thus, the applicability of a given technique can be

determined, so that the risk is relatively low.

On the other hand, causal analysis and anomaly discovery are characterized by the

discovery of novel relationships among variables. Some examples include the inference

of predictive models and the discovery of unexpected phenomena. An example is the

search for novel climatic indices and related teleconnections. This role is less common in

the scientific data mining world. This is not surprising since the novelty aspect makes it

difficult to ensure beforehand (say, at grant application time) that a useful result can be

achieved. However, it is that very novelty that also makes for a potentially high reward.

As such it represents an important niche for the systematic application of certain data

mining techniques as an alternative to the (often) serendipitous nature of human

discovery. This does not cede the role of scientific discovery to machine learning

techniques, since such techniques typically do not provide a physical explanation or

model. Rather, machine learning techniques should proceed hand in hand with methods

more grounded in the natural sciences, the first identifying novel aspects for study, the

second fitting them into an understandable scientific model or framework.

Underexploited Categories of Problems

While it can be argued that data mining methods are underexploited throughout the Earth

sciences, there are two general categories of problems where such methods could make a

unique contribution. The first of these is problems where the physical mechanisms are

poorly known, or too complex to model. In these cases, inference of association rules or

predictive models could provide constraints and guidance toward zeroing in on the

42

dominant underlying physical mechanisms. Similarly, problems involving non-physical

relationships (e.g. socio-economic factors) can be investigated with machine learning

methods, though even here our eventual goal is an explanation of how and why these

relationships occur.

Documents

2 nd NASA Data Mining Workshop: Issues and …projects.itsc.uah.edu/datamining/meeting06/docs/2nd_NASA...On May 23-24, 2006, NASAÕs Earth Science Division sponsored the Second NASA