Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
2nd
NASA Data Mining Workshop:
Issues and Applications in Earth Science
May 23rd – 24th, 2006
Pasadena, California
Final Report
2
Table of Contents Executive Summary ...................................................................................................... 3
Key Findings........................................................................................................... 3
Recommendations ................................................................................................... 4
1. Overview of the Workshop .................................................................................... 6
a. Objectives ............................................................................................................ 6
b. Attendance ........................................................................................................... 7
c. Agenda ................................................................................................................. 7
2. Analysis of Results.................................................................................................. 8
a. Current and Emerging Technology Themes .......................................................... 8
b. Connection to NASA’s Earth Science Agenda.................................................... 10
c. Infusing Statistics and Data Mining into Earth Science Research ........................ 16
3. Recommendations ................................................................................................ 20
Appendix 1: Call for Papers ....................................................................................... 22
Appendix 2: List of Attendees .................................................................................... 25
Appendix 3: Final Agenda .......................................................................................... 26
Appendix 4: Summary of Workshop Presentations .................................................. 29
Appendix 5: Current and Emerging Technology Themes......................................... 36
3
Executive Summary
On May 23-24, 2006, NASA’s Earth Science Division sponsored the Second NASA
Data Mining Workshop: Issues and Applications in Earth Science Data, held in
conjunction with the Interface 2006 Symposium, at the Pasadena Westin Hotel. The
workshop, which was organized by a team from NASA and the University of Alabama in
Huntsville (UAH), was a successor to a previous workshop held in Huntsville in October
1999. The objectives of this second workshop were to again bring together researchers
from the Earth science, data mining, and statistics communities to see what results had
been achieved in the intervening 6 years, as well as to identify areas where data mining
and statistics could potentially yield significant scientific advances in Earth science.
The workshop consisted of an opening session of introductory talks from NASA and
Interface participants, five sessions of invited talks, and poster presentations. These were
selected from responses to a call for papers and organized by science topic. The
workshop concluded with a panel session on “How to promote the infusion of data
mining and statistical technologies into Earth science”. Sessions included discussion
time following presentations so participants could interact with the speakers and with
each other. A web page describing the workshop and containing all of the abstracts and
presentations can be found at: http://datamining.itsc.uah.edu/meeting06/index.html.
Key Findings
• The data mining and statistical methods presented at this workshop are considerably
more mature than they were 6 years ago. The presentations provided important
insights in a number of areas, with many of the techniques showing potential for
significantly advancing scientific understanding in various areas of Earth science.
• Data analysis approaches that have been historically employed in Earth science are no
longer adequate for dealing with the complexity, size, and novelty of NASA’s 21st
century data resources. New statistical methodologies and data mining algorithms that
address these issues need to be developed and infused into mainstream Earth science
research.
• The chief obstacles to infusion of modern data analysis methods in Earth science are:
1) the lack of publication venues and funding opportunities that promote innovative
data analysis in Earth science research, 2) the disconnect between “modeling the
data” and relating it back to underlying physical processes, and 3) the hesitancy of
Earth scientists to adopt new data analysis methods that have not been fully vetted
and accepted within their community.
• A conceptual framework is needed to articulate the roles that statistics and data
mining can play in advancing Earth science research. Such a framework should: 1)
link questions about Earth system processes to questions about data, and 2) provide
4
an infrastructure for making inferences from the data back to the underlying state of
the Earth’s system, and translating those inferences into physically meaningful
conclusions. Section 2 of this report outlines one possible framework, based on the
uses of data in NASA’s Earth science research program.
• Promoting collaboration between the existing Earth science, statistics, and data
mining communities is useful, but not sufficient, due to the intellectual and cultural
barriers. Establishing a new professional community, composed of researchers who
work in the intersection of Earth science, statistics, and data mining, may yield
greater impact over the long-term.
Recommendations
• NASA Program Managers in both the Earth science and technology development
areas should work together to meld modern data analysis research into mainstream
Earth science research by: 1) adding criteria to proposal opportunities that require or
reward development and/or use of modern data analysis methods in Earth science
research, and 2) establishing a funding mechanism specifically for the development of
new statistical and data mining methodologies that respond to data analysis problems
arising from the use of massive observational data sets to answer key Earth science
questions.
• NASA should take the lead in establishing a new professional community dedicated
to scientific discovery through the development and use of modern statistical and data
mining methods. This must go beyond collaboration to foster a new generation of
researchers with training in both Earth science and data analysis. Through its
education programs, NASA should encourage this type of interdisciplinary training at
both graduate and undergraduate levels. Today’s students are tomorrow’s members
of the new community.
• NASA should form a new working group to be made up of community leaders who
work in the intersection of Earth science and statistics/data mining. This Earth
Science Data Exploration and Analysis Working Group (ESDEAWG) would:
- identify areas where current data analysis practices could be improved, either
through development of new techniques or infusion of existing ones that are
hitherto unexploited (or underexploited);
- develop a Technology Readiness Level (TRL) ladder that is appropriate for
measuring progress and maturity of data analysis methodologies;
- provide recommendations to NASA on fostering the interdisciplinary
professional community described;
- establish a set of standard hybrid statistical-physical process models (see
Section 2 of this report) that can be used to calibrate research results based on
different methodologies against one another;
- identify or create a set of benchmark datasets the community can use to test
and compare different methodologies on the same data;
5
- formulate a framework that articulates the roles of statistics and data mining
as a means to advance Earth science in NASA’s approach to scientific
discovery;
- encourage established geoscience journals to devote special issues to new
methods for data analysis1.
• NASA should hold workshops in Earth science, data mining, and statistics on a more
frequent basis. NASA should also sponsor a set of focused tutorials designed to train
Earth scientists in modern statistics and data mining and to train statisticians and data
miners in Earth science. These could be held in conjunction with existing professional
meetings such as AGU or AMS, or could be in the form of a Gordon Research
Conference (see http://www.grc.uri.edu/).
1 In the biological sciences, there are dedicated journals, such as Nature Methods, that fulfill this need.
6
1. Overview of the Workshop The Second NASA Data Mining Workshop: Issues and Applications in Earth
Science Data was held on May 23-24, 2006 in conjunction with the Interface 2006
Symposium at the Pasadena Westin Hotel. The workshop was organized by a team from
the NASA and the University of Alabama in Huntsville (UAH), and was the successor to
the First NASA Data Mining Workshop held in Huntsville in October 1999. Members of
the organizing committee are listed below. These individuals conceived and produced
this workshop thanks to sponsorship by NASA’s Earth Science Division. The Program
Committee reviewed and selected the papers for presentation. The organizers, Program
Committee members, and additional contributors (all listed below) helped shape and run
the workshop and write this report.
Organizers
Amy Braverman NASA/Jet Propulsion Laboratory
Elaine Dobinson NASA/Jet Propulsion Laboratory
Sara J. Graves University of Alabama in Huntsville
Program Committee
Michael C. Burl NASA/Jet Propulsion Laboratory
Becky Castano NASA/Jet Propulsion Laboratory
Thomas Hinke NASA/Ames Research Center
Christopher S. Lynnes NASA/Goddard Space Flight Center
Bernard Minster University of California San Diego
Rahul Ramachandran University of Alabama in Huntsville
Additional Contributors
Jeanne Behnke NASA/Goddard Space Flight Center
Lynne Carver University of Alabama in Huntsville
Michael Garay NASA/Jet Propulsion Laboratory
Stephanie Granger NASA/Jet Propulsion Laboratory
Danny Hardin University of Alabama in Huntsville
Brian Wilson NASA/Jet Propulsion Laboratory
a. Objectives
Data from Earth-orbiting satellites have been accumulating at a very high rate for several
years now. In combination with in-situ observations and physics-based simulations, this
enormous, distributed repository holds answers to important questions about our planet’s
past, present and future. However, the information is only accessible if effective analysis
capabilities can be brought to bear. Data mining and statistics have the potential to
provide these capabilities, and, if employed in close coordination with Earth science
research, will significantly increase the science return from NASA’s vast Earth science
data collection.
7
The objectives of the Second NASA Data Mining Workshop were to 1) bring together
Earth scientists, statisticians, and data miners to match the needs of the scientific
community to existing capabilities provided by these data analysis experts, and 2) suggest
future research directions for data analysts to pursue to help advance Earth science
research. In particular, the workshop sought to facilitate formation of collaborative
relationships between Earth scientists and data analysts, and to identify specific problems
that these collaborations can address. To this end, NASA issued an open call for papers2,
which is included in this report as Appendix 1.
b. Attendance
The workshop was attended by approximately 50 people, including NASA program and
project managers, presenters from the statistics and data mining communities, Earth
scientists, and members of the organizing and program committees. The workshop size
was consistent with the organizing and program committees’ goal of having a diverse, but
productive workshop. The full list of attendees and their affiliations can be found in
Appendix 2.
c. Agenda
The workshop agenda consisted of an opening session of introductory talks from NASA
and Interface participants followed by five sessions of invited talks and a poster session.
The Workshop concluded with a panel session on “How to promote the infusion of data
mining and statistical technologies into Earth science”. The complete workshop agenda
is listed in Appendix 3.
Opening the first day, Dr. Francis Lindsay from NASA’s Earth Science Division
discussed NASA’s data mining and analysis objectives and their relation to the
workshop’s objectives. He discussed NASA funding opportunities relevant to data
mining and statistical analysis and highlighted the contrast between core and community
data system development at NASA. He charged the participants with helping to narrow
the gap between information technology and Earth science and suggesting strategies
for moving these techniques into pertinent Earth science communities and data
systems. Next, Dr. Mary Ann Esfandiari, Program Manager for EOSDIS, spoke about
NASA’s plans for evolving its data systems. She presented a data system vision for the
year 2015 that provides increased interoperability, and greater flexibility and support for
users. Concluding the opening session, Dr. Ed Wegman of George Mason University
gave a talk, entitled “Statistics, Data Mining, and Climate Change”, which showed the
danger of using improper data analyses to reach scientific conclusions.
The remaining sessions consisted of oral and poster presentations from the collection of
papers submitted in response to the call. A total of 37 papers were submitted to the
workshop. Of these, 16 were selected for oral presentations, and 15 were chosen for
posters. The posters were presented the first evening during an informal reception. The
2 The call was posted in standard advertising venues for the computer science/data mining communities
(e.g., KDNuggets, ACM Calendar of Events) and for the Earth science community (AGU Meeting and
EOS). In addition, the call was directly sent to over 300 individuals with interest in scientific data mining.
8
oral presentations covered a wide spectrum of scientific data mining and statistical
applications relevant to NASA’s Earth science mission. They also covered a variety of
different techniques and approaches. Appendix 4 contains a complete summary and
discussion of the workshop presentations.
The following sections of this report are devoted to more thorough analyses of aspects of
the workshop program, and a set of recommendations resulting therefrom.
2. Analysis of Results In this analysis we examine the intellectual content of the workshop in order to formulate
a coherent picture of the state of the practice, and to suggest avenues for advancing
contributions of statistics and data mining to NASA’s Earth science research objectives.
a. Current and Emerging Technology Themes
As shown in Appendices 3 and 4, the workshop sessions were organized around science
discipline, in alignment with NASA’s Earth Science program. However, in the following
discussion and more completely in Appendix 5, we regroup (or reinterpret) the workshop
content according to technology theme, which leads to valuable insight about the current
state-of-the-practice and suggests possible directions for future work. During the course
of the workshop, it became clear that: 1) within a given science discipline, a broad variety
of data mining techniques could be applied, and 2) across different science disciplines,
the same data mining technique might be used3. Table 5.1 in Appendix 5 categorizes the
workshop papers by both science focus area and technology theme. This categorization
illustrates several points pertaining to the current state-of-the-practice and future themes.
Current State-of-the-Practice A significant number of papers involve infrastructure activities and end-user tools that are
not tied to any specific science focus area. Infrastructure activities include, for example,
data management methods for organizing, storing, querying, and transferring large
volumes of data, creating and exploiting ontologies and metadata, and methods for
parallel and distributed data mining. End-user tools allow for retrieval, visualization, and
other interactive operations with large datasets. Although the initial reaction may be that
technologies should be more tightly coupled to science focus areas, it must be
acknowledged that there is indeed much in common across focus areas (especially at this
low-level of data manipulation and processing); hence, factoring the commonalities into
infrastructure that can support multiple focus areas is reasonable. Researchers pursuing
such efforts are cautioned, however, that connecting their efforts back to the science
datasets and to the needs of the scientists themselves is critical to ensure useful products
emerge. A tool without a user or use is likely a wasted effort.
A second observation is that the most dominant application of data mining and statistics
to Earth science data involves the use of supervised learning techniques for land cover
classification (under the Carbon & Ecosystems focus area) with such applications now
3 As a concrete example of the latter, support vector machines (SVMs) were used both for land cover
classification and atmospheric cloud classification.
9
becoming fairly mature. In fact, very similar methods have been applied to cloud
classification and incorporated into the Langley DAAC for MISR processing.
Another popular application for data mining and statistics involves the use of clustering
techniques for spatio-temporal pattern identification usually at a global or regional scale
with most of the activity in the Climate and Carbon/Ecosystems focus areas. This activity
likely reflects the broad interest in climate change and its impact on and interactions with
ecosystems, as well as the suitability of EOS data for doing these types of studies.
A more complete analysis of these and other observations is given in Appendix 5, along
with the table.
Future Themes Given the current state-of-the-practice, can we identify current theoretical developments
or problem areas that are likely to have significant impact five years down the road?
Clearly, there are many areas in Table 5.1 that are empty or sparsely populated. These
cells indicate opportunities where data mining and statistical methods can potentially
create breakthroughs. Some of these we have already noted, but one of the more
promising areas for future work is the incorporation of physically-inspired process
models into data mining endeavors. In Table 5.1 we see that there were very few papers
at the workshop that attempted to incorporate domain knowledge directly into the process
model. Many of the scientists at the workshop lamented that the data mining and
statistical approaches may do a good job of modeling the data, but do not provide insight
into why they’re doing a good job or a connection back to underlying physical processes.
By combining a physically-inspired process model with an observational (data) model
and characterizing the uncertainties within these two models, we may better be able to
make inferences about the underlying physical processes. Section 2b below expands upon
this idea, providing a more detailed conceptual model and discussion.
Generalizing the Role of Data Mining Methods in Earth Science Research The roles these technologies play in Earth science research as a whole depend on the
particular method and application, but can be broadly categorized into two classes: 1)
Data Characterization and Feature Detection, and 2) Causal Analysis and Anomaly
Discovery. Data characterization and feature detection includes such technologies as
classification techniques, kriging and uncertainty analysis, and clustering and statistical
summarizations. Typically, the primary focus is on providing a more understandable
characterization (or view) of the underlying structure of large amounts of science data.
While they are not generally targeted toward directly extracting scientific results, these
methods can make massive data sets comprehensible and thus tractable to further
scientific analysis. An important aspect of these techniques is that the problem to be
solved is generally well-defined. Thus, the applicability of a given technique can be
determined, so that the risk is relatively low.
On the other hand, causal analysis and anomaly discovery are characterized by the
discovery of novel relationships among variables. Some examples include the inference
of predictive models and the discovery of unexpected phenomena. An example is the
10
search for novel climatic indices and related teleconnections. This role is less common in
the scientific data mining world. This is not surprising since the novelty aspect makes it
difficult to ensure beforehand (say, at grant application time) that a useful result can be
achieved. However, it is that very novelty that also makes for a potentially high reward.
As such, it represents an important niche for the systematic application of certain data
mining techniques as an alternative to the (often) serendipitous nature of human
discovery. This does not cede the role of scientific discovery to data mining, since such
techniques typically do not provide a physical explanation or model. Rather, data
mining should proceed hand in hand with methods more grounded in the natural
sciences, the first identifying novel aspects for study, the second fitting them into an
understandable scientific model or framework.
b. Connection to NASA’s Earth Science Agenda
The workshop discussions repeatedly stressed that success in achieving relevance in
NASA’s mainstream scientific endeavors depends upon the central and continuous
involvement of the Earth science community, and this success is unlikely without a more
cohesive, focused data analysis community dedicated to solving Earth science problems.
Themes that resonated and recurred throughout the workshop included the relationship
between data analysis (a term used here to encompass both statistics and data mining) and
science understanding, the need for infusion beyond simple collaboration, and the need
for community. The remainder of this section is devoted to discussing the problems of
using data to advance scientific knowledge, and to discussing the issues of collaboration
and community in detail.
As noted previously, many interesting and successful projects were presented at this
Second Workshop. The data mining and statistics communities have clearly made
significant headway toward developing methods to address NASA’s Earth science
technology needs since the First Workshop in 1999. However, still more can be done to
infuse these methods into mainstream Earth science, and to develop new methods in
response to new problems. To accomplish this, the role of data in NASA’s approach to
Earth science needs to be examined, and the uses of data mining and statistics need to
be focused more clearly on turning those data into science understanding.
Understanding Earth Science Data A primary focus of this workshop was to determine how data mining and statistical
analysis could advance scientific understanding of the Earth’s system. NASA’s Earth
Observing System (EOS) datasets provide massive quantities of observational data and
are an enormous resource for scientists working to understand physical processes that
make up that system. However, both the volume and complexity of the data sets are
impediments to their full exploitation. The workshop presented many methods and
approaches to address these issues, but as we listened to the varied problems and
solutions it was difficult to understand the relationships among them. A fundamental
conclusion is that a conceptual model or framework is needed to organize these
methods, one that is rigorous enough to provide structure and guidance, but flexible
enough to accommodate the wide variety of techniques and issues.
11
How, then, can we construct such a flexible but rigorous framework within which to link
broad Earth science questions to statistical analysis and data mining methods that may
help answer them? To bridge this gap two things are required. First, science questions,
which are questions about the Earth’s systems’ processes, must be translated into
questions about data. Second, answers to data questions must be linked back to
underlying physical processes. If these two requirements can be made quantitative, they
will provide the required structure.
To examine the role and uses of data to solve key Earth science questions, we use the
NASA “Climate Variability and Change” science focus area as an example of a NASA-
defined science objective. We present a very general conceptual model that provides a
mechanism for inference, and helps organize the problems, techniques and applications
presented at the workshop in order to better target our community’s efforts to further
NASA’s science goals.
NASA’s Approach to Science
The Climate Variability and Change Roadmap (Figure 1) poses the following questions:
• How is the global ocean circulation varying on interannual, decadal, and longer
time scales?
• What changes are occurring in the mass of the Earth's ice cover?
• How can climate variations induce changes in the global ocean circulation?
• How is global sea level affected by natural variability and human-induced change
in the Earth system?
• How can predictions of climate variability and change be improved?
According to the Roadmap’s “Where we plan to be” box, NASA seeks to characterize
and reduce the uncertainties in long-term climate prediction and provide routine
probabilistic forecasts of key climate variables. This speaks to the last question above:
how can predictions of climate variability and change be improved? The other four
questions appear to be aimed at better understanding oceanic and cryospheric component
processes that both influence and are influenced by the general climate system. Other
roadmaps identify similar types of questions related to other Earth system components.
This organization provides a general picture of how NASA’s scientific community
approaches its work to answer key questions: through a continuum of research and
modeling efforts to understand physical processes ranging from very specific, highly
constrained local and regional studies, to coupled global climate models which link
process models together and simulate feedbacks. The community has achieved an
impressive compromise between decentralized and centralized efforts, the latter led by
major modeling centers such as the National Center for Atmospheric Research (NCAR)
and the Geophysical Fluid Dynamics Laboratory (GFDL).
12
Figure 1. Climate Change and Variability Roadmap.
The Role of Data
Data contribute to this process in several ways. First, exploratory data analysis (EDA)
elucidates and/or discovers variables and relationships that help advance understanding of
physical processes. To the extent that physical models represent the community’s best
understanding of these processes, data contribute to model improvement. We call this
role “hypothesis formulation and discovery”. Second, data make hypothesis testing
possible. Formal hypothesis tests are part of confirmatory data analyses (CDA, i.e.,
13
inference) that assess the magnitudes of observed phenomena relative to their
uncertainties. Only when the former exceed the latter by a large enough margin do we
have a basis for concluding that phenomena are statistically significant. Otherwise,
observed phenomena might be artifacts of sampling. We call this role “hypothesis testing
and model diagnosis”. Here it is especially important to distinguish between hypotheses
about data and hypotheses about underlying physical processes. Data in hand provide a
particular, but not necessarily representative, view of underlying physical mechanisms.
Third, in coupled climate models, or even in complex physical process models, statistical
descriptions of data provide “parameterizations” of components that need to be
represented, but which are insufficiently understood to model deterministically. Here
again it is crucial that these statistical descriptors respect the underlying process and not
just the observed data.
These three roles of data aid directly in achieving better understanding of the climate
system, and therefore will help improve long-term climate prediction by deterministic
models. A fourth role is data assimilation, which provides real-time optimal estimates of
important quantities by combining deterministic model predictions and observations in a
Bayesian statistical framework. Unlike the previous three applications, in which data are
used to improve deterministic models that then make predictions, data assimilation
ingests model predictions after the predictions are made, and updates them in view of the
“evidence” provided by observations. Assimilation produces evolving, best estimates of
the true underlying state of the system. We do not include this fourth use of data in
formulating our conceptual view because the goal of assimilation is not to produce better
deterministic models, but to produce evolving best estimates of process-level climate
variables in real time.
A Framework for Data Analysis and Scientific Inference
In this section we offer a simple example of a conceptual model that relates the data
observed to the underlying physical mechanisms that generate them. The model has two
parts. The first part, called a statistical process model, states that the actual value of the
quantity of interest is the sum of a base value and local variation. The second part, called
the data model, states that the observed value of the quantity of interest is the sum of the
actual value and measurement error.
A statistical process model is a probability model that describes the behavior of quantities
with probability distributions. For example, “Actual total column water vapor in
Bakersfield, California in December is the sum of base value
!
µ and a Gaussian local
perturbation with zero mean and variance
!
" 2” is an example of a very simple statistical
process model. Unlike a deterministic model, which would describe total column water
vapor by physical equations, this statistical model views the data generating mechanism
probabilistically.
The data model describes the relationship between the actual value of the quantity of
interest and observations of it. For example, “Measured total column water vapor in
Bakersfield, California in December is subject to error, where that error follows a
Gaussian distribution with mean
!
" and variance
!
" 2” is a simple data model. Combining
14
the statistical process and data models provides a route to inference: to produce an
estimate of the base value, one uses the mean of the observations. To quantify uncertainty
in that estimate, one must account for both the measurement uncertainty (
!
" 2) and the
variation of the actual values (
!
" 2).
This conceptual model is very general, and to be of practical use it must be tailored to the
specific setting of the problem in which it is to be used. For instance, in the Bakersfield
example, one may ask whether: a) the Gaussian distribution is the right distribution; b)
the values around Bakersfield are statistically independent of one another as is implied by
treating all local perturbations as if they arose from a single Gaussian; c) the spatial scale
attributed to the base value is appropriate to the physical phenomenon being studied; d)
the spatial correlations embodied in local perturbations about the base value are
appropriate; e) measurement error is independent of the base value; and very importantly,
f) can the base values themselves be described with a physical model? If f) is true, the
statistical process model becomes a hybrid, statistical-deterministic model and is a
mechanism for explicitly injecting scientific knowledge into the analysis.
There will be a host of such questions specific to any given investigation, and the answers
need to be codified mathematically so that uncertainty and biases are properly handled.
Sometimes the goal will be to obtain estimates, along with their uncertainties, of
physically meaningful parameters, and at other times the goal may be to develop a good
hybrid statistical-deterministic model capable of making accurate predictions at places or
times where measurements are unavailable. Another goal may be to make optimal
predictions or estimates where multiple data sources provide conflicting information
about the same or related quantities. This is the data fusion problem. In all cases it is
essential to base conclusions on a model framework that is scientifically sound and
statistically rigorous. This two-part concept, made up of a statistical-deterministic
process model and a data model, provides a mathematical bridge between the data and
the underlying physical processes we seek to understand. We need to cross that bridge
in order take our conclusions from the data we analyze to answers to the Earth science
questions.
The Role of Data Mining and Statistical Analysis
Earlier we described three distinct roles of data in NASA’s approach to science: 1)
hypothesis formulation and discovery, 2) hypothesis testing and model diagnosis, and 3)
parameterization. The role of data mining and statistical analysis is to provide
machinery for employing data in each of these roles.
With respect to hypothesis formulation and discovery, NASA’s remote sensing data sets
are often so large that, like other massive data sets, their volume makes working with
them difficult. In fact, a fundamental problem, which many works presented at this
workshop addressed, is characterizing the data and relationships therein. For example,
supervised learning methods seek to classify or estimate the value of an unknown
quantity in the main body of a data set based on relationships observed in a portion of the
data. In other words, some data points are complete in that they contain observations of
both a quantity, say y, to be predicted and variables x that may be useful for making the
15
prediction. The set of available (x,y) pairs is called the training set. In a large portion of
the data, the quantity to be predicted (y) is missing. One viewpoint on supervised learning
is that it characterizes the joint distribution of x and y using the training data, and then
extrapolates that relationship to the remainder of the dataset in order to fill in the missing
information. The objective is to discover systematic relationships that were not known
previously, and understand their implications. However, all the uncertainty associated
with this procedure arises from uncertainty surrounding the representativeness of the
training data relative to the full dataset and the inductive bias of the particular learning
algorithm used. The procedure does not include a formal attempt to infer corresponding
systematic relationships in the process that generated the data.
This may be one reason why data mining techniques are not widely accepted in Earth
science: its methods are often directed only toward discovery, and not toward hypothesis
formulation or testing. Few data analysts follow through to formulate testable, physical
hypotheses. This means teaming with Earth scientists to understand why, in physical
terms, observed relationships occur. This is hypothesis formulation and brings home the
point that, for scientists, pure predictive capability is not enough: scientists want to
understand why. Mining applications that do not provide information relevant to
hypothesis formulation will not capture the attention of the Earth science community.
Once a physical hypothesis exists, the next step is to formulate a test of that hypothesis
relative to the underlying statistical or statistical-deterministic process model.
Constructing an appropriate process model that adequately captures physical
understanding and properly treats uncertainty requires both statistical and scientific
expertise. The hypothesis is a statement about the process model for which the
implications can be mathematically propagated through the data model. These
implications exhibit themselves as probabilistic statements about what we would expect
to observe in the data if the hypothesis were true. By comparing the actual data to that
which is expected, the test determines whether or not data and hypothesis are consistent
with one another, and attaches a probabilistic confidence statement to the conclusion. If
the process model, the data model, or the propagation is incorrect, the conclusions will
not be credible. Moreover, any statistical parameterizations used as placeholders for
missing physical specification must be relative to the underlying process, not just the
observed data.
This discussion led to an important conclusion: to be more effective in Earth science,
data miners and statisticians must work together with physical scientists to complete
the chain of analysis through to inference about the Earth’s system. This will be a very
problem-specific task. The exact forms of the data, process and physical models involved
will be unique, and must be melded together carefully if uncertainties are to be
represented properly.
16
c. Infusing Statistics and Data Mining into Earth Science Research
At the 1999 NASA Data Mining Workshop participants voiced a concern that without the
development of practical data mining methods a great deal of Earth science data would be
underexploited, fundamentally undermining efforts to increase understanding of the Earth
system. Certainly in some measure, this concern has been realized. The Earth science
community still appears to view statistical and data mining applications in Earth science
as solutions in search of problems. Involvement of Earth scientists, while greater than six
years ago, remains relatively low.
Teaming Earth scientists and data analysis professionals is fundamental to achieving
success in this modern era of massive Earth science datasets. These datasets are so large
and complex that traditional data analytic methods are unable to exploit them fully.
Moreover, important scientific questions these data can help answer may never even be
asked because the lenses provided by simple techniques traditionally used by Earth
scientists are not sufficiently discriminating. Therefore, it is essential that modern
statistical and data mining methods become part of mainstream Earth science.
Workshop attendees identified a number of cultural and technical factors contributing to
the isolation of Earth science and data analysis communities. The next sections of this
report examine those factors and suggest how they might be overcome.
Current Obstacles Data mining and large-scale statistical applications for Earth science problems are, for the
most part, still in development. A fair proportion of results presented at the workshop
involved tests on small data sets as case studies. Only a few examples of mature
approaches were presented. Part of the reason for the slow development of data mining
and analysis methods since 1999 is the slow adoption and implementation of a data
access infrastructure able to handle and deliver the amount of data being acquired by
NASA’s Earth Science Enterprise. Even so, the current state of affairs can’t be blamed on
problems with data access alone. Rather, the chief obstacles identified by workshop
participants are a lack of clear scientific leadership, participation, and support from
the Earth science community, and the tendency of data miners and statisticians to want
to solve problems that are interesting to them rather than to Earth scientists.
Workshop participants identified a variety of reasons for this isolation. First, scientists as
a whole, and well-established senior scientists in particular, tend to be conservative when
it comes to adopting new approaches. To illustrate this point, Figure 2 shows the number
of publications listed in the INSPEC database containing the acronym “NDVI” –
Normalized Difference Vegetation Index, now a standard approach in remote sensing of
land surfaces. The first publication listed in the database appeared in 1985. The number
of publications remained well below ten for seven years, until 1992, when the number of
publications began to show a steady increase. Whether due to data access issues or
scientific skepticism, this shows how long it can take for a new idea to make its way into
mainstream science analysis.
17
A primary reason for slow adoption of new practices in Earth science is the prevailing
perception that, in order to obtain funding and have papers accepted for publication, it
is necessary to work with established techniques. New approaches are “risky” and are
consequently less likely to get funding. This presents a real challenge for the professional
data analysis community because domain scientists’ participation is crucial to establish
the validity of new data analytic methods in the first place. Even early-career domain
scientists, who tend to be somewhat more receptive to new ideas, are unlikely to devote
substantial amounts of time to data mining and statistical techniques that are outside their
realm of expertise, and which may eventually turn out to be ineffective.
0
10
20
30
40
50
60
70
80
90
100
1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Year
NDVI Publications by Year from INSPEC Article Database
Figure 2
Not only is the Earth science community generally risk-averse to new data analytic
methods, but also many scientists believe that they already know “enough” statistics and
are unlikely or unwilling to enlist the help of statistical experts when it comes to data
analysis problems. This is partly due to the limited statistical training that students in
Earth sciences receive as part of their graduate or undergraduate schooling. Although
basic statistical and data analytic techniques are introduced, new techniques do not
usually appear in the classroom, nor are students likely to acquaint themselves with the
statistical research literature because this is well outside their area of interest and
expertise. The result is that the majority of statistical analysis in the Earth science
literature rarely goes beyond simple, off-the-shelf methods. More advanced statistical
approaches are rare, and generally limited to research led by statisticians in the absence
of domain-relevant inspiration.
A different cultural problem impedes the infusion of mainstream data mining into Earth
science. The perception of many domain scientists is that methods such as supervised
classification and machine learning are simply “magic black boxes” that give what appear
to be correct results given certain inputs. This, however, subverts the usual paradigm for
how science is done, as shown in Figure 3 below. In this figure, the “traditional” Earth
science approach is shown by the shaded arrows. Earth science progresses by taking
observations, applying knowledge of physical principles to these observations, using this
knowledge to develop models, and allowing the models to make predictions. The data
18
mining approach also begins with the observations, feeds these observations through the
“magic black box,” which produces predictions. In either paradigm, the predictions can
feed back to the observations for comparison. However, unknown to the domain
scientists, and often the data analysis experts themselves, embedded within the “magic
black box” are potentially important clues to physical reasons that explain why data
mining produces good predictions. Data miners need to provide this information to
scientists in order to illuminate the relationships between good predictions and the
physics that underlie them. Unfortunately, very little research within the data analysis
community is currently going into uncovering these “hidden models”.
Figure 3
Another point raised by workshop participants was that science is hypothesis driven, but
data mining is data-driven. From the perspective of domain scientists, it often appears
that data mining is nothing more than a “fishing expedition”. In this view, an algorithmic
approach is developed by data analysis experts and applied “blindly” to Earth science
data, without scientific guidance, and in hopes of discovering some relevant scientific
result. There is a sense that the data analysis community has too much of a “have tool,
will travel” attitude toward Earth science data analysis and lacks appropriate direction
and focus.
This belies a major intellectual gap that must be filled: the failure of all parties to
understand the roles of data mining and statistics in Earth science research. The role of
data mining hypothesis formulation and discovery is distinct from the role of statistical
hypothesis testing, model parameterization, and inference. There is an important role
for the so-called “fishing expedition” in understanding the content of a massive data set,
but only if it follows through to the hidden (physical) model, and formulation of testable
hypotheses. To be testable, hypotheses must be expressed in terms of the data, and related
to a model of the underlying data generating mechanism. Too often the data analysis
expert thinks his or her job is over after characterizing relationships in the data. In reality,
it has just begun. The next steps require Earth scientists to participate in finding the
hidden model and articulating a hypothesis, and statisticians to formulate and test that
hypothesis.
Bridging the Gap
The main obstacles discussed above are the risk-averse culture of Earth science, along
with the structural aspects of funding and publications mechanisms that reinforce it, and
19
the failure of professional data analysts to: a) provide methods that help explain
underlying physical processes rather than merely make predictions, and b) follow through
with hypothesis formulation, testing, and inference.
Overcoming these challenges requires a concerted effort to develop a new professional
community consisting of truly interdisciplinary researchers – people who are trained in
both Earth science, and statistics or data mining. That training need not be formal, at
least in the short term, but it does require a commitment to understanding the
fundamentals of statistical analysis, data mining, and Earth science.
The new community, which might be called Earth Science Analytics, would reward
development of new data analytic methods that respond to new scientific questions, data
types, and computational capabilities. High value would be placed on new science
discoveries made possible by new data analytic techniques. Funding mechanisms that
reward this would be crucial: NASA and NSF would have to respond by setting up
special programs. NSF already has the Collaborations in Mathematics and the
Geosciences program, but the new programs would have to go well beyond collaboration.
NASA funds Earth science and technology through different offices and perhaps this is
contributing to the isolation of the two communities. Workshop participants discussed the
pros and cons of starting a new journal for publication of their results, but concluded that
using existing journals, perhaps organizing special editions, would be more practical.
NASA is in a unique position to foster a professional community dedicated to bringing
modern statistical and data mining technologies into Earth system science. NASA
sponsors both Earth science and technology, and has a great interest in seeing return of
science understanding from its data. A key finding of this workshop is that NASA should
establish a new advisory group to create and promote the efforts of this community. Only
NASA can bridge this gap by organizing and motivating the constituent experts from
all areas to work together.
20
3. Recommendations The following list of recommendations to NASA stem from the discussions above and
those at the workshop:
• NASA Program Managers in both the Earth science and technology development
areas should work together to meld modern data analysis research into mainstream
Earth science research by: 1) adding criteria to proposal opportunities that require or
reward development and/or use of modern data analysis methods in Earth science
research, and 2) establishing a funding mechanism specifically for the development of
new statistical and data mining methodologies that respond to data analysis problems
arising from the use of massive observational data sets to answer key Earth science
questions.
• NASA should take the lead in establishing a new professional community dedicated
to scientific discovery through the development and use of modern statistical and data
mining methods. This must go beyond collaboration to foster a new generation of
researchers cross-trained in both Earth science and data analysis. Through its
education programs, NASA should encourage both graduate and undergraduate
interdisciplinary training. Today’s students are tomorrow’s members of the new
community.
• NASA should form a new working group to be made up of community leaders who
work in the intersection of Earth science and statistics/data mining. This Earth
Science Data Exploration and Analysis Working Group (ESDEAWG) would:
- identify areas where current data analysis practices could be improved, either
through development of new techniques or infusion of existing ones that are
hitherto unexploited (or underexploited);
- develop a Technology Readiness Level (TRL) ladder that is appropriate for
measuring progress and maturity of data analysis methodologies;
- provide recommendations to NASA on fostering the interdisciplinary
professional community described;
- establish a set of standard hybrid statistical-physical process models (see
Section 2 of this report) that can be used to calibrate research results based on
different methodologies against one another;
- identify or create a set of benchmark data sets the community can use to test
and compare different methodologies on the same data;
- formulate a conceptual model that articulates the roles of statistics and data
mining as a means to advance Earth science in NASA’s approach to scientific
discovery;
- encourage established geoscience journals to devote special issues to new
methods for data analysis.
• NASA should hold workshops in Earth science data mining and statistics on a more
frequent basis. NASA should also sponsor a set of focused tutorials designed to cross-
21
train Earth scientists in modern statistics and data mining and to cross-train
statisticians and data miners in Earth science. These could be held in conjunction with
existing professional meetings such as AGU or AMS, or be in the form of a Gordon
Research Conference (see http://www.grc.uri.edu/).
22
Appendix 1: Call for Papers
Second NASA Data Mining Workshop:
Issues and Applications in Earth Science
May 23-24, 2006
Westin Hotel, Pasadena, CA
http://datamining.itsc.uah.edu/meeting06/index.html
Data from Earth-orbiting satellites have been accumulating at a very high rate for several
years now. In combination with in-situ observations and physical model output, this
enormous, distributed repository holds the answers to important questions about our
planet’s past, present and future. However, the information is accessible only if effective
analysis capabilities can be brought to bear. Data mining has the potential to provide
these capabilities, and, if employed in close coordination with Earth science research, can
increase the science return from NASA’s vast Earth science data collection.
Workshop objectives:
The objectives of this Second NASA Data Mining Workshop are to bring together Earth
scientists and data miners to match the needs of the scientific community to existing
capabilities provided by computer scientists and statisticians, and suggest future research
directions they may pursue to help advance Earth science research. In particular, we seek
to facilitate formation of collaborative relationships between Earth and data scientists,
and identify specific problems those collaborations can address. To those ends, we will:
1. Assess the progress that has been made in Earth science data mining and analysis
since the first NASA Data Mining Workshop held in 1999 (see
http://datamining.itsc.uah.edu/meeting/); and
2. Identify areas where data mining could potentially yield significant scientific
advances in Earth science in the near and medium term.
Call for papers:
In order to facilitate an effective exchange of ideas and meaningful discussions, the
number of participants will be limited to approximately 40 selected submissions. It is
23
important that all participants commit to being in residence for the full duration of the
workshop.
The workshop format will be a combination of oral presentations, posters, breakout
sessions, and open plenary discussions. The agenda will be organized around papers
submitted in response to the following breakdown of the two, high-level workshop
objectives listed above.
1.1 Description of successful projects. We seek papers that describe the nature of
the data mining techniques used, and how they contributed to the scientific results of the
project. We are also interested in particular characteristics of the collaborative interaction
that contributed to the project’s success, or could have been improved.
1.2 New projects. Descriptions of projects that are just getting started, but have a
significant data mining component that the authors believe will further the scientific
objectives. Papers should describe the data mining techniques used, how the authors
anticipate these techniques will contribute to the project’s scientific goals, and how the
project is organized to facilitate interaction between Earth scientists and data miners.
2.1 Unsolved scientific problems. Descriptions of difficult scientific problems that
have not been successfully addressed, but which the authors feel could potentially be
addressed by data mining methods. Papers should describe techniques that have
previously been applied, and why they have not been adequate. Authors should also
provide some evidence or justification for appropriateness of data mining as a solution,
and discuss requirements or constraints that would apply.
2.2 New applications for proven data mining techniques. Descriptions of data
mining techniques that have been successfully used in areas outside of Earth science, and
which the authors believe would be useful to Earth scientists. Papers could also include
techniques that have been used for one area of Earth science research which the authors
believe would be applicable to other areas. Authors should address why these techniques
have not been previously applied in the proposed area, and what the impediments are to
their near-term application.
2.3 New data mining techniques. Descriptions of new data mining techniques
emerging from the data mining research community that may not have been previously
applied to any real problem, but which the authors believe should be considered for use
by the Earth science community. Papers in this group can include speculative ideas for
data mining techniques that may still be in the early development stage.
Papers should be typeset in a single-column format, in 12pt font, and for letter or A4
sized paper. Papers should not exceed 4 pages (not counting references), and should be
submitted in PDF format.
Please submit your paper to Elaine Dobinson, by email to [email protected],
with “NASA Data Mining Workshop Submission” in the subject line. Also indicate in
24
your email which topic area (1.1, 1.2, 2.1, 2.2, or 2.3) your paper addresses. Please note
that due to time constraints some papers will be selected for oral presentation while
others will be posters. All accepted submissions will be published as a NASA technical
report.
The submission deadline is 11:00 pm (PST), January 16, 2006. Notification of acceptance
will be no later than March 17, 2006.
Follow on opportunity:
This NASA Data Mining Workshop precedes Interface 2006, the 38th
Symposium on the
Interface of Computing Science, Statistics and Applications, at the Westin, May 24-27,
2006. A joint reception is scheduled for Wednesday evening (May 24), and a special
Interface session will be devoted to the results of this NASA Data Mining Workshop.
NASA Data Mining Workshop participants are invited to attend Interface 2006 at the
Interface member’s registration rate. Please see:
http://www.galaxy.gmu.edu/Interface2006/i2006webpage.html
for more information.
25
Appendix 2: List of Attendees The list of attendees can also be found at http://datamining.itsc.uah.edu/meeting06/attendees.html.
Name Institution Email
Faleh Alshameri George Mason University [email protected]
Jeanne Behnke Goddard Space Flight Center [email protected]
Shyam Boriah University of Minnesota [email protected]
Kirk Borne George Mason University [email protected]
Amy Braverman Jet Propulsion Laboratory [email protected]
Michael C. Burl Jet Propulsion Laboratory [email protected]
Yang Cai Carnegie Mellon University [email protected]
Doina Caragea Iowa State University [email protected]
Becky Castano Jet Propulsion Laboratory [email protected]
Mete Celik University of Minnesota [email protected]
Yi Chao Jet Propulsion Laboratory [email protected]
Noel Cressie Ohio State University [email protected]
Elaine Dobinson Jet Propulsion Laboratory [email protected]
Saso Dzeroski Jozef Stefan Institute, Ljubljana, Slovenia [email protected]
Mary Ann Esfandiari Goddard Space Flight Center [email protected]
Mark Friedl Boston University [email protected]
Ivan Galkin University of Massachusetts Lowell [email protected]
Michael Garay Jet Propulsion Laboratory [email protected]
Stephanie Granger Jet Propulsion Laboratory [email protected]
Robert Grossman University of Illinois Chicago [email protected]
Geoffrey M. Henebry South Dakota State University [email protected]
Tin Kam Ho Bell Labs [email protected]
Rie Honda Kochi University, Japan [email protected]
Jason Hyon Jet Propulsion Laboratory [email protected]
Mehrdad Jahangiri University of Southern California [email protected]
Praveen Kumar University of Illinois Urbana [email protected]
S. Mark Leidner Atmospheric and Environmental Research, Inc. [email protected]
Francis Lindsay NASA [email protected]
Christopher S. Lynnes Goddard Space Flight Center [email protected]
Hal Maring NASA [email protected]
Jeff Masek Goddard Space Flight Center [email protected]
Zoran Obradovic Temple University [email protected]
Rahul Ramachandran University of Alabama in Huntsville [email protected]
Rob Raskin Jet Propulsion Laboratory [email protected]
Joseph Roden Battelle [email protected]
Rob Sherwood Jet Propulsion Laboratory [email protected]
Tao Shi Ohio State University [email protected]
Ashok N. Srivastava Ames Research Center [email protected]
Ranga Raju Vatsavai University of Minnesota [email protected]
Jorge Vazquez Jet Propulsion Laboratory [email protected]
Kiri Wagstaff Jet Propulsion Laboratory [email protected]
Amy Walton NASA [email protected]
Ed Wegman George Mason University [email protected]
Lisa Wilcox Goddard Space Flight Center [email protected]
Brian Wilson Jet Propulsion Laboratory [email protected]
Robert Wolfe Goddard Space Flight Center [email protected]
Jin Soung Yoo University of Minnesota [email protected]
26
Appendix 3: Final Agenda The final agenda can also be found at http://datamining.itsc.uah.edu/meeting06/agenda.html.
Tuesday, May 23rd 8:00 - 8:30 Registration & Continental Breakfast
8:30 - 10:00 Session 1: Opening Talks, Session Chair: Elaine Dobinson; Session
Recorder: Jeanne Behnke
Welcome, logistics, workshop format & objectives [Slides] - Elaine Dobinson
NASA Role in Data Mining and Desired Workshop Results [Slides] - Francis Lindsay,
NASA HQ
Plans for the Evolution of EOSDIS [Slides] - Mary Ann Esfandiari, EOSDIS
Statistics, Data Mining, and Climate Change - Ed Wegman, George Mason University
10:00 - 10:30 Break
10:30 - 12:00 Session 2: Atmosphere 1, Session Chair: Chris Lynnes; Session Recorder: Stephanie Granger
Satellite Data: Massive but Sparse [Slides]
Speaker: Noel Cressie; Authors: Noel Cressie and Tao Shi; Ohio State University
A Hybrid Object-based/Pixel-based Classification Approach to Detect Geophysical Phenomena [Slides]
Speaker: Rahul Ramachandran; Authors: Xiang Li, Rahul Ramachandran, Sara Graves, and Sunil Movva; University
of Alabama in Huntsville
Essentials for Modern Data Analysis Systems [Slides]
Speaker: Mehrdad Jahangiri; Authors: Mehrdad Jahangiri, and Cyrus Shahabi; University of Southern California
12:00 - 1:30 Lunch
1:30 - 3:00 Session 3: Climate, Session Chair: Becky Castano; Session Recorder: Chris Lynnes
Knowledge Discovery From Global Remote Sensing and Climate Data: Results from Supervised and Unsupervised
Data Mining [Slides]
Speaker: Mark Friedl; Authors: Mark Friedl and Carla Brodley; Boston University
Characterizing Variability and Multi-resolution Predictions of Virtual Sensors [Slides]
Speaker: Ashok N. Srivastava; Authors: Ashok N. Srivastava and Rama Nemani; NASA Ames Research Center
The Application of Clustering to Earth Science Data: Progress and
Challenges [Slides]
Speaker: Shyam Boriah; Authors: Michael Steinbach, Pang-Ning Tan, Shyam Boriah, Vipin Kumar, Steven Klooster,
and Christopher Potter; University of Minnesota
3:00 - 3:30 Break
3:30 - 5:00 Session 4: Surfaces, Session Chair: Amy Walton; Session Recorder: Rebecca Castano
Spatiotemporal Data Mining for Monitoring Ocean Objects [Slides and Demo]
Speaker: Yang Cai; Authors: Yang Cai, Karl Fu, Daniel Chung, Richard Stumpf, Timothy Wynne, and Mitchell
Tomlison; Carnegie Mellon University
Temporal Modeling and Missing Data Estimation for MODIS Vegetation Data [Slides]
Speaker: Rie Honda; Author: Rie Honda; Kochi University, Japan
Using Land Surface Phenology for Spatio-temporal Mining of Image Time Series: A Manifesto [Slides]
Speaker: Geoffrey M. Henebry; Authors: Geoffrey M. Henebry and Kirsten M. deBeurs; South Dakota State
University
Recent HARVIST Results: Classifying Crops from Remote Sensing Data [Slides]
Speaker: Kiri Wagstaff; Authors: Kiri Wagstaff and Dominic Mazzoni; Jet Propulsion Laboratory
5:30 - 7:00 Poster Session & Reception
Automated Metadata for Image Mining
Presenter: Faleh Alshameri; Authors: Faleh Alshameri, and Ed Wegman; George Mason University
Clustering Spatio-Temporal Patterns using Levelwise Search
Presenter: Raj Bhatnagar; Authors: Abhishek Sharma and Raj Bhatnagar; University of Cincinnati
Automated Wildfire Detection Through Artificial Neural Networks [Slides]
Presenter: Kirk Borne; Authors: Kirk Borne (GMU), Jerry Miller (NASA), Brian Thomas (UMD), Zhenping Huang
(UMD), and Yuechen Chi (GMU); George Mason University
Sensory Stream Data Mining on Chip
Presenter: Yang Cai; Authors: Yang Cai and Yong X. Hu; Carnegie Mellon University
Knowledge Discovery from Disparate Earth Data Sources [Slides]
27
Presenter: Doina Caragea; Authors: Doina Caragea and Vasant Honavar; Iowa State University
Parameter Estimation for the Spatial Autoregression Model: A Rigorous Approach
Presenter: Mete Celik; Authors: Mete Celik, Baris M. Kazar, Shashi Shekhar, and Daniel Boley; University of
Minnesota
Intelligent Archive Technologies for NASA/IMAGE Radio Plasma Imager Data [Slides]
Presenter: Ivan Galkin; Authors: I. Galkin, G. Khmyrov, A. Kozlov, B.W. Reinisch, R. Benson, and S. Fung; U Mass
Lowell
Predicting Forest Stand Height and Canopy Cover from LANDSAT and LIDAR Data Using Decision Trees
Presenter: Saso Dzeroski; Authors: Saso Dzeroski, Andrej Kobler, Valentin Gjorgjioski, and Pance Panov; Jozef
Stefan Institute, Ljubljana, Slovenia
Selection Technique for Thinning Satellite Data for Numerical Weather Prediction [Slides]
Presenter: Mark Leidner; Authors: Christian Alcala, Ross N. Hoffman, and S. Mark Leidner; Atmospheric and
Environmental Research, Inc.
Show and Tell: A Seamlessly Integrated Tool for Searching with Image Content and Text
Presenter: Olfa Nasraoui; Authors: Zhiyong Zhang, Carlos Rojas, Olfa Nasraoui, Hichem Frigui; University of
Louisville
Asynchronous Data Mining Tools at the GES-DISC
Presenter: Christopher S. Lynnes; Authors: Long B. Pham, Stephen W. Berrick, Christopher S. Lynnes and Eunice K.
Eng; NASA - GES DAAC, Goddard Space Flight Center
Miner: A Suit of Classifiers for Spatial, Temporal, Ancillary, and Remote Sensing Data Mining
Presenter: Ranga Raju Vatsavai; Authors: Ranga Raju Vatsavai and Shashi Shekhar; University of Minnesota
The LBA-ECO Metadata Warehouse and Its Implications for Data Mining Initiatives
Presenter: Lisa Wilcox; Authors: Lisa Wilcox, Amy L. Morrell, and Peter C. Griffith; NASA Goddard Space Flight
Center
Data Mining Via Smart Grid Workflow: The SciFlo Dataflow Execution Network
Presenter: Brian Wilson; Authors: Brian Wilson, Dominic Mazzoni, Gerald Manipon, and Benyang Tang; Jet
Propulsion Laboratory
A Framework for Mining Co-evolving Spatial Events
Presenter: Jin Soung Yoo; Authors: Jin Soung Yoo and Shashi Shekhar; University of Minnesota
Wednesday, May 24th
8:00 - 8:30 Continental Breakfast
8:30 - 10:00 Session 5: Land Cover, Session Chair: Mike Burl; Session Recorder: Rahul Ramachandran
Multiscale Analysis Of Data: Clusters, Outliers and Noise - Preliminary Results
Speaker: Robert Grossman; Authors: Chetan Gupta and Robert Grossman; University of Illinois Chicago
Unraveling the Dominant Influences on the Evolution of Land-Surface Variables using Data Mining [Slides]
Speaker: Praveen Kumar; Authors: Praveen Kumar, Peter Bajcsy, Amanda B. White, Vikas Mehra, David Tcheng,
David Clutter, Wei-Wen Feng, Pratyush Sinha, and Richard Robertson; University of Illinois Urbana
Adopting Semi-supervised Learning Algorithms for Mining Remote Sensing Imagery: Summary of Results and Open
Research Problems [Slides]
Speaker: Ranga Raju Vatsavai; Authors: Ranga Raju Vatsavai, Shashi Shekhar, and Thomas E. Burk; University of
Minnesota
10:00 - 10:30 Break
10:30 - 12:00 Session 6: Atmosphere 2, Session Chair: Rahul Ramachandran; Session Recorder: Brian Wilson
An Operational Pixel Classifier for the Multi-angle Imaging SpectroRadiometer (MISR) Using Support Vector
Machines [Slides]
Speaker: Michael Garay; Authors: Dominic Mazzoni, Michael Garay, and Roger Davies; Jet Propulsion Laboratory
Data Mining Support for Aerosol Retrieval and Analysis – Project Summary
Speaker: Zoran Obradovic; Authors: Zoran Obradovic, Bo Han, Qifang Xu, Yong Li, Amy Braverman, Zhanqing Li,
and Slobodan Vucetic; Temple University
Polar Cloud Detection using MISR and MODIS Data [Slides]
Speaker: Tao Shi; Authors: Tao Shi, Bin Yu, Eugene E. Clothiaux and Amy J. Braverman; Ohio State
12:00 - 1:30 Lunch
1:30 - 3:00 Session 7: Application of Data Mining and Statistics to Earth Science Research, Session Chair and Panel
Moderator: Amy Braverman; Session Recorders: Chris Lynnes, Stephanie Granger
NASA's Approach to Earth Science [Slides] - Hal Maring, Radiation Sciences Program, Science Mission Directorate,
NASA
28
Panel Discussion - How to promote the infusion of data mining and statistical technologies into Earth science
Panel Members: Sara Graves, Francis Lindsey, Jeffrey Masek, Hal Maring, Ed Wegman, and Amy Walton
3:00 Workshop Adjourns – Report Writing Begins (Program Committee, Panel Members, & Volunteers)
29
Appendix 4: Summary of Workshop Presentations
Submitted abstracts and workshop presentations are available on
http://datamining.itsc.uah.edu/meeting06/agenda.html.
There were a total of 37 papers submitted to the workshop with 16 selected for
presentations and 15 as posters. The papers selected for presentation covered a wide
spectrum of scientific data mining applications relevant to NASA’s Earth science
mission. The papers covered different mining techniques, problems and themes within
Earth science. Consequently, technical presentations for the workshop were organized
based on science themes: Atmosphere, Climate, Surfaces and Land Cover.
Session 2: Atmosphere 1 N. Cressie described the problem of sparseness in massive satellite datasets and presented
an optimal statistical method for smoothing these massive, sparse datasets. His work
with Tao Shi focused on creating a Multiangle Imaging Spectro Radiometer (MISR) level
3 aerosol optical depth product using Fixed Rank Kriging (FRK). Traditional kriging
methods require inversion of covariance matrices and can be computationally expensive.
FRK was presented as an efficient smoothing method that considerably reduced the mean
squared prediction errors as compared to other spatial smoothing methods. His
presentation emphasized the importance of using appropriate statistical techniques (e.g.,
covariance-function estimation) during the data preparation step, before carrying out
optimal smoothing. The importance of handling uncertainty during the preparation and
analysis was highlighted.
R. Ramachandran described a collaborative mining project focused on detecting
geophysical phenomena, specifically fronts, in numerical model output. The science goal
for this project was to create a climatology of targets that could be used for further
analysis, including model validation and verification. A hybrid methodology that
combined both pixel level and object level mining was presented. An unsupervised
clustering algorithm was used to perform soft classification to identify possible frontal
pixels. A hierarchical thresholding technique was then used in conjunction with a Bayes
classifier to detect different regions as fronts. The presentation highlighted the
importance of the using the domain experts during the mining process. In particular,
the domain experts helped during the data preparation step and in creating the “truth”
data.
M. Jahangiri presented a data storage and retrieval system that provides scientists the
functionality of performing complex statistical queries. The system is specifically
designed to handle very large multidimensional datasets such as one produced by the
Atmospheric Infrared Sounder (AIRS) instrument. The querying capability of this
system is built on discrete wavelet transforms to provide approximate answers with
progressively increasing accuracy to the users. The basic idea is to provide fast queries
by converting the data to wavelet coefficients then querying regions and reconstructing
the results. These queries work on multiple resolutions, thus a user is able to summarize
30
at various levels of abstractions on the fly. One of the obstacles faced by authors while
designing and developing the system was the sparseness of the information as compared
to the file size.
Session 3: Climate M. Friedl presented results and lessons learned from mining large volume and high
dimensional EOS datasets. The mining techniques applied in his research applications
covered both supervised and unsupervised classifiers. A supervised classification
technique, specifically a decision tree, was used to create a global land cover model using
Moderate Resolution Imaging Spectroradiometer (MODIS) datasets. Unsupervised
methods such as Independent Component Analysis and Canonical Correlations were used
to identify spatio-temporal patterns of joint ecosystem-climate variability at global scales
using gridded climate and remote sensing data sets. Even though these were successful
applications, a number of lessons pertaining to scientific data mining were learned. The
data and not the technique are of the paramount importance in supervised learning.
The data dictate issues like unbalanced training, proper feature selection, normalization,
etc. The use of unsupervised learning techniques needs to be driven by a science
hypothesis otherwise the whole analysis process is in danger of being deemed a “fishing
expedition.” The presenter emphasized the lack of mature data mining toolkits that are
capable of handling large data sets and the need to have collaborative teams including
both data analysis experts and domain scientists to prevent scientists from doing naïve
analysis and data analysis experts from doing naïve science.
A. Srivastava presented the concept of a Virtual Sensor. This concept is based on the
assumption that there are potentially nonlinear relationships between spectra, and that one
can reconstruct a signal given available redundant data. Data mining techniques can be
used to learn these nonlinear relations for the reconstruction. Once learned, these
relationships can be used to backcast and make multi-resolution predictions. Backcasting
involves estimating the value of unmeasured spectra given other measured spectral
components. Multi-resolution prediction entails estimating high spatial resolution spectra
based on relationships learned from low resolution measurements. Kernel methods were
used for learning and the results based on the initial experiments support the concept of a
Virtual Sensor.
Application of clustering to Earth science data was presented by S. Boriah. Clustering
techniques were applied to discover climate indices. The detection of climate indices is
important in order to find relationships with other climate phenomena such as El Niño. A
new density based clustering technique called Shared Nearest Neighbor (SNN) was
designed for this application. SNN technique did a better job detecting the known
climate indices, compared to traditional approaches that used Singular Value
Decomposition. The presentation also delved into challenges of clustering Earth
science data. These challenges arise from the fact that Earth science phenomena evolve
in both space and time. Consequently, there is a need to develop new clustering
algorithms that can find dynamic clusters. These new algorithms need to handle the
inherent spatiotemporal information embedded in the Earth science datasets.
31
Session 4: Surfaces A NASA/ESTO funded project on spatiotemporal data mining for monitoring ocean
objects was presented by Y. Cai. The case study was based on detecting and predicting
algae blooms. The components required to achieve this objective include object
segmentation and tracking, and the use of mining techniques to make predictions. A
spatial density clustering technique was used in conjunction with a convex hull to
segment the interesting regions and ignore regions with missing data. A shape correlation
filter based on the Fast Fourier Transform was used to track the object in time. A neural
network was trained to make spatiotemporal predictions. The input to the neural network
included a set of historical data of the object and additional physical parameters such as
wind and temperature. The neural network thus predicted the spatiotemporal evolution of
the object. The use of cellular automata was also explored to create simulations.
Coupling the multi-physics models into the data mining models was one of the
challenges presented to audience. The presentation emphasized the need to have a key
domain expert involved both in the data mining project and in developing data mining
techniques.
R. Honda presented a methodology for temporal modeling and missing data estimation
for MODIS vegetation data. The common obstacles for spatiotemporal mining from EOS
satellite data are sparseness, noise and missing data. The methodology presented used a
Maximum a Posterior (MAP) approach to fit a temporal model. The MAP estimate was
then combined with spatial information using an ensemble of regression trees (Random
Forests). The results presented showed that the temporal model built using MAP did
better than the piecewise logistic function model and did not require any additional
preprocessing of the data, such as smoothing or segmentation.
An exposition of issues for spatio-temporal mining of image time series was presented by
G. Henebry. These issues centered around synoptic ecology, which focuses on the
interface between the land surface and the atmosphere and has the science need to
distinguish between unusual changes from the expected variation. Three questions
fundamental to this issue were presented. What is the appropriate unit of analysis? It was
posited that the scale depends on the question at hand but is certainly not the “pixel”
level. The second question looked at what baseline needs to be used to make
comparisons? The presenter postulated that current statistical techniques for modeling
spatio-temporal data are focusing on the images. However, what is of scientific interest
are not the images themselves, but rather the process or patterns they portray. The author
presented a simple analogy to explain this point. One could create a baseline for a movie
by sampling a set of frames, but a better baseline would be the one that captures the plot
of the movie. The third question centered on conducting change analysis. Five steps to
conduct such analysis were presented: detection, quantification, assessment, attribution
and consequences.
K. Wagstaff presented results from the Heterogeneous Agricultural Research Via
Interactive, Scalable Technology (HARVIST) project. The goal of the project is to
develop a standard toolkit to enable interactive analysis of relationships among multiple
32
science data products, and provide capabilities such as clustering, classification, and
statistical analysis for prediction. Classification and prediction of crop yield was the
specific science application used to demonstrate the tool. Ground truth was used to train
a Support Vector Machine (SVM) classifier for MISR satellite data. The ground truth
consisted of measurements made for ten crops at 99 different sites in California. The
presenter mentioned the interactive nature of the tool was appealing to the scientists and
that one-on-one, in person interaction was much more effective in establishing a
working relationship between the scientists and data miners, as compared to email,
phone calls or the exchange of papers.
Session 5: Land Cover
R. Grossman presented a multi-scale analysis of large, distributed data sets in resource-
constrained environments. Three challenges in resource constrained data mining were
presented and possible solutions were advocated by the means of case studies. Accessing
large volumes of data was the first challenge. A new application level network protocol
was put forth as a possible solution to exploit high bandwidth to promote distributed
mining. The second challenge focused on modeling heterogeneous data. An algorithm
based on hierarchical decomposition was presented. The algorithm allows processing at
different scales (levels of decomposition). The third challenge dealt with deployment
issues in decision support systems for real world uses. A traffic alert system that uses
Doppler radar data in real time was presented as an example.
Use of data mining techniques to discover dominant influences on the evolution of land
surface variables such as vegetation indices, temperature, and emissivity was presented
by P. Kumar. The dependence of these surface variables on invariant factors such as
topography, soil characteristics and land cover type was also explored. A regression tree
induction was used to model the dependence of greenness on different parameters using
MODIS data. The rules generated by the tree induction algorithm were then analyzed to
discover the dominant influences. The presentation described the data mining system
(GeoLearn) that combines an Image2Knowledge (I2K), Data2Knowledge (D2K), and an
ArcGIS engine. The challenge for such a system is handling the heterogeneity of
formats, scale, resolution, sampling, etc. of the Earth science datasets. The key theme
highlighted during this presentation was that a successful data mining application is a
cyclically coupled process between data, science drivers, and the mining technology.
R. R. Vatasavai presented the use of semi-supervised learning algorithms for mining
remote sensing data. Acquiring labels or truth data for supervised classification is a
costly, time consuming and labor intensive process. It can be error prone or subjective,
as in the case where different experts classify a region differently. Semi-supervised
classification attempts to learn using partially labeled training data. An algorithm for
semi-supervised classification and experimental results from the algorithm were
presented. One of the interesting suggestions made during this presentation was the
creation of benchmark datasets. There are well-known benchmark datasets archived at U.
of California, Irvine that are used by the mining community to develop new algorithms.
It was suggested that NASA create and provide a set of benchmark Earth science
33
dataset to the mining community to foster the development of new algorithms geared
towards geoscience data.
Session 6: Atmosphere 2 M. Garay presented the use of Support Vector Machines (SVM) with data from the
Multi-angle Imaging SpectroRadiometer (MISR) to classify clouds. Cloud detection and
screening are important steps towards achieving the MISR science objectives. However,
the current, physically derived, MISR cloud masks have limitations. Consequently, a
SVM was used to create a cloud mask, detect thin cirrus, and perform multiclass
classification. The truth data used to train the classifier was generated by two
independent experts. The SVM classifier produced good results and the code is being run
at the Langley DAAC as a part of the MISR standard processing. SVMs provide number
of advantages compared to other classification techniques. They tend to balance accuracy
and the ability to generalize. However, the inability to explain the underlying physical
reasons for the SVM success limits their acceptance by the broader Earth science
community.
Results from a study that applied data mining techniques for Aerosol Optical Thickness
(AOT) retrieval were presented by Z. Obradovic. First, a neural network was used to
analyze the influence of different attributes on the AOT retrieval accuracy. Then, a
decision tree was employed to analyze the conditions where the MISR or the MODIS
AOT retrievals could be improved by using neural network predictions. Interesting rules
were discovered such as: in presence of clouds, the neural network predictions
outperform MISR AOT retrievals at 72% of locations; while in desert regions, under
certain conditions, neural network predictions can improve MODIS retrievals at 80% of
the locations. Finally, the MODIS and MISR were used to generate combined AOT
retrievals, with improved accuracy.
T. Shi presented a methodology for improving polar cloud detection by fusing MISR and
MODIS data. First, an experimental MISR polar cloud detection algorithm was
developed. This algorithm, Enhanced Linear Correlation Matching (ELCM), used three
MISR angular signatures to detect polar cloud. A consensus mask was then created by
fusing both the MISR experimental and the MODIS operational cloud mask. The
consensus mask was used to train a classifier for MISR data. A Quadratic Discriminate
Analysis classifier was used. This classifier models each class density as a multivariate
Gaussian distribution. The use of fused data for training gave better results. The
effective collaboration between scientists and statisticians was again emphasized as
crucial for the success of the project.
Session 7: Application of Data Mining and Statistics to Earth Science
(Panel Discussion) Hal Maring, Program Scientist and Manager for the Radiation Science Program, gave a
science manager’s perspective on science research in NASA. He stressed that NASA
does science that is policy relevant and policy driven. The whole research effort is to
build a comprehensive Earth System model where the observations are used to improve
process models, and the process models are coupled together into decision support
34
systems that can be used by policy managers. There are challenges in utilizing the data
properly because of disparities in space, time, parameter, uncertainty, source, data type,
etc. There are other challenges in fundamental climate research, such as characterization
of uncertainty, setting climate observation requirements and priorities. The
characterization of uncertainty of the models is essential if these models are going be
used to make predictions about the future.
Amy Walton described different funding possibilities in ESTO for projects that apply
data mining techniques to Earth science problems. ESTO seeks cutting edge, but
science-driven, technologies. The proposals have to have a clear connection between
innovative technology and science needs. The proposals are peer reviewed by both the
science and the technology community to provide a balance.
Jeff Masek summarized a scientist’s views on data mining after listening to all the
technical presentations. He argued that conventional science follows the path where
observations feed models that generate predictions. These predictions are compared with
observations for validation. The models used are built on physics. Data mining
circumvents the traditional path as the observations themselves are used to build the
model and generate predictions. Data mining can be critical for success in science
problems where the physics is not well understood or the models are too complex or the
relationships between parameters cannot be conceptualized by scientists. Some
possible applications in terrestrial ecology could include finding the relationship between
vegetation dynamics and climate; prediction of the future trajectories of land cover
changes such as urbanization, deforestation; and the intelligent assimilation of land plot
data (biodiversity, flux towers) with remote sensing data. His presentation stressed that
data miners should look for science problems where the physical models fail or disagree.
The need for a tighter collaboration between scientists, statisticians and data miners was
again emphasized. These collaborative teams should focus on the physical process model
and the interpretation of the mined discovery. Mining applications should also focus on
datasets that are high dimensional and high volume, such as the entire MODIS dataset,
NCAR climate records, etc.
Poster Session The poster presentations were given during a reception following Session 4. There were
15 posters selected for the workshop. However, two were not presented due to last
minute illnesses.
There were five posters submitted on data mining tools. I. Galkin presented a poster on
intelligent archive technologies for NASA/Image radio plasma imager data. C. Lynnes
presented data mining tools at the GES-DISC that allows batch mode processing on large
volumes of data. A suite of classifiers for remote sensing data was presented by R. R.
Vatsavai. SciFlo, a tool that allows users to compose and execute mining workflows on
the grid, was presented by B. Wilson. A tool for searching images by O. Nasraoui, based
on content rather than metadata, was chosen for presentation, but unfortunately the
speaker could not attend.
35
Three posters focused on different applications of data mining to science problems. M.
Leidner demonstrated the use of wavelets to thin satellite data for numerical weather
prediction. S. Dzeroski presented a study that used a decision tree classifier to predict
forest stand height and canopy cover using LANDSAT and LIDAR data. An application
of neural network for wildfire detection was presented by K. Borne.
There were two posters that focused on metadata related issues. L. Wilcox presented the
LBA-ECO metadata warehouse and its implications on data mining. A method to
augment existing metadata using automated methods was presented by F. Alshameri.
In addition, there were posters that covered other interesting topics. An FPGA based
reconfigurable sensory stream data mining processor was presented by Y. Cai. A general
strategy that allows algorithms to learn from semantically heterogeneous data sources and
allow knowledge discovery was presented by D. Caragea. M. Celik presented a faster,
scalable version of a spatial auto regression model (NORTHSTAR) for spatial data
analysis. J. S. Yoo presented a mathematical framework for mining co-evolving spatial
events. An algorithm for mining spatio-temporal patterns was to have been presented by
R. Bhatnagar, but unfortunately the speaker was ill and unable to attend.
36
Appendix 5: Current and Emerging Technology Themes
In alignment with NASA’s Earth Science program, the workshop sessions were
organized around science discipline. However, as the workshop proceeded, it was clear
that: 1) within a given science discipline, a broad variety of data mining techniques could
be applied and 2) across different science disciplines, the same data mining technique
might be used4. Thus, reorganizing (or reinterpreting) the workshop content according to
technology theme is a potentially valuable exercise for gaining insight into the current
state-of-the-practice and predicting directions for future work.
Table 5.1 takes this exercise a step further and categorizes the workshop papers both by
science focus area and technology theme. The rows of the table correspond to technology
theme and the columns correspond to Earth science focus area as defined at
http://science.hq.nasa.gov.earth-sun/science/. An additional column, denoted by a “*”,
was added to capture papers that were not closely tied to a particular science focus area.
In some cases, papers were assigned both to a specific discipline and to the “*” category,
e.g., a paper that proposed a general method that could clearly be applied in multiple
focus areas, but was evaluated within the context of a specific focus area. Also, note that
the rows of the table are grouped into three main divisions: process model (red), data
model (green), and infrastructure/tools (blue).
Current State-of-the-Practice
From Table 5.1, several points are clear. First, a significant number of papers involve
infrastructure activities and end-user tools that are not tied to any specific science focus
area. Infrastructure activities include, for example, data management methods for
organizing, storing, querying, and transferring large volumes of data, creating and
exploiting ontologies and metadata, and methods for parallel and distributed data mining.
End-user tools allow for retrieval, visualization, and other interactive operations with
large datasets. Although the initial reaction may be that technologies should be more
tightly coupled to science focus areas, it must be acknowledged that there is indeed much
in common across focus areas (especially at this low-level of data manipulation and
processing); hence, factoring the commonalities into infrastructure that can support
multiple focus areas is reasonable. Researchers pursuing such efforts are cautioned,
however, that connecting their efforts back to the science datasets and to the needs of the
scientists themselves is critical to insure useful products emerge. A tool without a user or
use is likely a wasted effort.
Another observation from the Table 5.1 is that the most dominant application of data
mining and statistics to Earth science data involves the use of supervised learning
techniques for land cover classification (under the Carbon & Ecosystems column) with
such applications now becoming fairly mature. In fact, very similar methods have been
4 As a concrete example of the latter, support vector machines (SVMs) were used both for land cover
classification and atmospheric cloud classification.
37
applied to cloud classification and incorporated into the Langley DAAC for routine
MISR data processing.
Atmospheric
Composition
Carbon &
Ecosystems
Climate Surface5 &
Interior
Water Weather *
Physics-based
process model
14, 30 7
Probability-based
process model
1 1
Change, anomaly,
novelty
9, 27
Clustering 4, 7, 10, 22,
26 2, 4, 6 6 2 12
Discrete Events 17 7, 26, 13 2, 25 2 Focus of attention,
search
17 11 2 20
Fusion 30, 31 7, 10, 13, 15,
18, 22 15 19 5, 20
Missing data 1 4, 8, 18, 28 4 1, 5, 28
Multires., wavelets 1, 3 7, 26 19 1
Object detection,
segmentation
7, 13
Prediction
(classification &
regression)
1, 14, 30, 31 4, 7, 8, 10,
13, 16, 18, 22,
28
2, 4, 29 29 2, 29 1, 5
Spatial interp.,
spatial context
1, 30 16, 22, 28 25 1, 28
Spatio-temporal
pattern ident.
4, 9, 27 2, 4, 6 6 2 12
Tracking 7 2, 6 6 2 Uncertainty 1 1, 5
Infrastructure 3, 14, 17 15, 23, 26 15 3, 14,
15, 21,
24 Tools for end-user 3, 17 10, 13, 22, 27 20, 21
Table 5.1. Approximate classification of workshop papers by technology theme (row) and science
focus area (column). The column headings denote the six Earth science focus areas defined by NASA,
plus an additional column, designated “*”, which is intended to capture papers that were not closely
tied to a particular focus area. The row headings indicate technology themes or approaches. Note
that the rows of the table are grouped into three main divisions: process model (red), data model
(green), and infrastructure/tools (blue). The cell entries are references to particular papers (see key
in footnote below6).
5 The Surface and Interior Structure focus area emphasizes natural hazards such as earthquakes, landslides,
erosion, floods, and volcanic eruptions. 6 Author key for Table 5.1: (1) Cressie, (2) Ramachandran, (3) Jahangari, (4) Friedl, (5) Srivastava, (6)
Boriah, (7) Cai/Fu, (8) Honda, (9) Henebry, (10) Wagstaff, (11) Alshameri, (12) Bhatnagar, (13) Borne,
(14) Cai, (15) Caragea, (16) Celik, (17) Galkin, (18) Dzeroski, (19) Hoffman, (20) Nasroui, (21) Lynnes,
(22) Vatsavai/Shekhar/Burk, (23) Wilcox, (24) Wilson, (25) Yoo, (26) Grossman, (27) P. Kumar, (28)
Vatsavai/Shekhar, (29) Garay, (30) Obradovic, (31) T. Shi.
38
Another popular application for data mining and statistics involves the use of clustering
techniques for spatio-temporal pattern identification, usually at a global or regional scale
with most of the activity in the Climate and Carbon/Ecosystems focus areas. This activity
likely reflects the broad interest in climate change and its impact on and interactions with
ecosystems, as well as the suitability of EOS data for doing these types of studies.
In contrast, we note that certain science focus areas are under-represented. In particular,
the Surface (Hazards) and Interior Structure, Water, and Weather areas had three or fewer
papers each. It is not clear whether this reflects lack of suitable data or algorithms for
addressing the science problems in these areas, or is merely an artifact (e.g., of the
workshop advertising or agency funding priorities).
The growing importance of data fusion is also illustrated by the table. Nearly a dozen
papers involve some form of data fusion. Joint use of MISR and MODIS data was
particularly common, but other applications involved the fusion of some combination of
satellite, airborne, and in-situ sensors. The recent NASA ROSES AIST call, which
emphasized the development of Sensor Web technologies, is likely to drive further work
in data fusion.
If we drill down into the individual papers, we see that a wide variety of techniques are
currently being applied to Earth science datasets, including:
• Canonical Correlation Analysis
• Classifiers based on parametric class-conditional models (Bayes classifiers with
Gaussian models)
• Clustering (k-means, spatial density clustering, Shared Nearest-Neighbor)
• Correlation-based object tracking, particle filter tracking
• Decision and Regression Trees including Random Forests (ensembles of
Regression Trees)
• Expectation-Maximization (semi-supervised learning)
• Fixed-rank kriging (FRK) for spatial interpolation/prediction
• Kernel methods (Support Vector Machines (SVM), Support Vector Regression)
• Markov Random Fields to model spatial context
• Neural Networks (feedforward neural nets, RBF, SOM, ensembles of neural nets)
• Principal Components Analysis (PCA), Independent Components Analysis (ICA)
• Spatial Autoregression Models
• Wavelets
Most of these techniques have been around for some time now (typically ten years or
more!), which raises some interesting issues. First, applying established algorithms to
Earth science data (even with a novel arrangement of component steps) generally does
not create a story that is compelling enough for publication in computer science venues.
Compounded with the difficulty of publishing in physical science venues, this issue
creates a real bind for researchers working in this area. Second, one wonders whether the
existing methods are sufficient for the problems at hand (meaning there is no need to rely
on more cutting edge technology) or does this fact just reflect the speed at which new
algorithms trickle down from theory into practice. There is also a likely bias that favors
39
using algorithms that have been released in public toolboxes. For example, well-tuned
and tested versions of SVM, such as libsvm and SVM-Light, have certainly contributed
to the rapid growth of applications using this technique. Similarly, the C4.5
implementation of decision trees and the WEKA Toolkit, which contains a variety of
machine learning algorithms in Java, have made it easier for researchers to pull tools off-
the-shelf. Again, some caution must be applied. To paraphrase a comment by a science
colleague, “We must be careful to distinguish between science-driven tools and tool-
driven science.”
Future Themes
Given the current state-of-the-practice, can we identify current theoretical developments
or problem areas that are likely to have significant impact five years down the road?
Clearly, there are many regions in Table 5.1 that are empty or sparsely populated. These
cells indicate opportunities where data mining and statistical methods can potentially
create breakthroughs. Some of these we have already noted, but one of the more
promising areas for future work is the incorporation of physically-inspired process
models into data mining endeavors. In Table 5.1 we see that there were very few papers
at the workshop that attempt to incorporate domain knowledge directly into the process
model. Many of the scientists at the workshop lamented that the data mining and
statistical approaches may do a good job of modeling the data, but do not provide insight
into why they’re doing a good job or a connection back to underlying physical processes.
By combining a physically-inspired process model with an observational (data) model
and characterizing the uncertainties within these two models, we may better be able to
make inferences about the underlying physical processes. Section 2 expands upon this
idea, providing a more detailed conceptual model and discussion.
The following are some additional areas where we see both technical challenges and
opportunities.
• Evaluation of algorithm performance in the presence of spatial correlation –
A number of approaches now use supervised or semi-supervised learning, in
which a training set is used to set key algorithm parameters (e.g., the set of
support vectors and weighting coefficients in an SVM classifier). When
evaluating the effectiveness of such approaches on separate data, researchers must
be wary of the effects of spatial correlation. Tobler’s famous First Law of
Geography states that “Everything is related to everything else, but near things are
more related to each other.” In the context of evaluating algorithms, this means
that we must be mindful of the spatial relationship of the training data to the test
data. An experiment by Wagstaff et al demonstrated this point quite precisely:
when they used leave-one-pixel out training and testing, accuracy was 91%;
however, when they split the training and testing data spatially, accuracy dropped
to 82%.
• High cost of training labels – As machine learning methods make greater
penetration into Earth science applications, there is a growing demand for labeled
training examples. It is quite time-consuming (and mind-numbing) for a person to
manually label large quantities of data. Several trends may offer some relief in
this area. First, semi-supervised techniques that use a small amount of labeled
40
training data and a large amount of unlabeled data to bootstrap a classifier, such
as in the paper by Vatsavai et al., offer one approach for leveraging a small
number of training labels. Active learning methods, which can focus a human
expert toward labeling the examples that would be most beneficial to a learning
algorithm, are also attractive for getting the most benefit from the science expert’s
time. Another interesting approach to this issue was suggested in the paper by
Dzeroski et al., who used simultaneous observations with a low-coverage high-
resolution instrument and a high-coverage, low-resolution instrument to create a
training set. The virtual sensor method of Srivastava might also be exploited for
this purpose. Yet another approach involved the use of a numerical simulator to
generate training examples. This idea was used by several authors (e.g.,
Ramachandran et al., Cai), but has potential for broader use.
• Contextual variables – Several papers showed that data distributions can be
highly dependent on the values of auxiliary contextual variables. For example, the
regression tree work of P. Kumar applied to the growth pattern of EVI (Enhanced
Vegetation Index) showed substantially different influences (meteorological,
topographic, soil) conditional on the month of the year. Methods to identify
relevant contextual variables and ascertain their influence (and causality) on
processes are needed. Methods for understanding and incorporating spatial
context are also expected to be important.
• Data Fusion – Although data fusion is already being widely used in analysis of
Earth science data, we expect this trend to expand and see more use of physical
process models (numerical simulators), in-situ sensors, ontologies, and cross-
focus-area applications.
• Onboard and In-Stream Algorithms – As algorithms become more mature and
reliable we expect to see a larger infusion of data mining and statistics into
mainstream science processing, becoming a standardized part of the data
distribution path. The paper by Mazzoni et al. noted that their SVM for cloud
classification has been incorporated into the Langley DAAC. Going a step further,
technology demonstration experiments such as Autonomous Sciencecraft
Experiment have shown that pushing algorithms onboard spacecraft can lead to
novel science opportunities.
• Discovery and Change Detection – As we build up an extended record of
historical satellite observations, the importance of recognizing changes with
respect to the historical record will become even more important. Many current
approaches operate in a point vs. point mode, where a single observation is
compared against a single historical observation. Creating a richer model of the
historical data, possibly by incorporating physical models, may allow the
separation of expected change from change that is interesting or novel. Discovery
of patterns or trends that were unexpected or anomalous is an area with potential
growth, but somewhat difficult to rigorously evaluate.
• Decision support systems – Scientists at the workshop voiced concerns that data
analysis results were not being linked back to the underlying physical processes.
A similar concern involves connecting analysis results back to policy decisions,
providing manageable and effective decision support systems that can fuse
multiple data sources, models, and analysis into actionable recommendations.
41
• Metric-driven Progress – One problem the field currently faces is that different
techniques are applied on different datasets using different parameters and
evaluation conditions. It is difficult to determine which algorithms are best suited
for a particular situation or whether a new algorithm is doing better than
established techniques. In many other fields, such as handwritten digit
recognition, human face detection, human face identification, object recognition,
and segmentation, the emergence of standard benchmark datasets has led to
significant progress, allowing the effectiveness of algorithms to be evaluated,
compared and improved.
Generalizing the Role of Data Mining Methods in Earth Science Research
The roles the above technologies play in Earth Science research as a whole depend on the
particular method and application, but can be broadly categorized into two classes:
1. Data Characterization and Feature Detection
2. Causal Analysis and Anomaly Discovery
Data characterization and feature detection includes such technologies as classification
techniques, kriging and uncertainty analysis, and clustering and statistical
summarizations. Typically, the primary focus is on providing a more understandable
characterization (or view) of the underlying structure of large amounts of science data.
While they are not generally targeted toward directly extracting scientific results, these
methods can make massive data sets comprehensible and thus tractable to further
scientific analysis. An important aspect of these techniques is that the problem to be
solved is generally well-defined. Thus, the applicability of a given technique can be
determined, so that the risk is relatively low.
On the other hand, causal analysis and anomaly discovery are characterized by the
discovery of novel relationships among variables. Some examples include the inference
of predictive models and the discovery of unexpected phenomena. An example is the
search for novel climatic indices and related teleconnections. This role is less common in
the scientific data mining world. This is not surprising since the novelty aspect makes it
difficult to ensure beforehand (say, at grant application time) that a useful result can be
achieved. However, it is that very novelty that also makes for a potentially high reward.
As such it represents an important niche for the systematic application of certain data
mining techniques as an alternative to the (often) serendipitous nature of human
discovery. This does not cede the role of scientific discovery to machine learning
techniques, since such techniques typically do not provide a physical explanation or
model. Rather, machine learning techniques should proceed hand in hand with methods
more grounded in the natural sciences, the first identifying novel aspects for study, the
second fitting them into an understandable scientific model or framework.
Underexploited Categories of Problems
While it can be argued that data mining methods are underexploited throughout the Earth
sciences, there are two general categories of problems where such methods could make a
unique contribution. The first of these is problems where the physical mechanisms are
poorly known, or too complex to model. In these cases, inference of association rules or
predictive models could provide constraints and guidance toward zeroing in on the
42
dominant underlying physical mechanisms. Similarly, problems involving non-physical
relationships (e.g. socio-economic factors) can be investigated with machine learning
methods, though even here our eventual goal is an explanation of how and why these
relationships occur.