42
Environmental Applications of Data Mining Saˇ so Dˇ zeroski Department of Knowledge Technologies, Joˇ zef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia Abstract Data mining, the central activity in the process of knowledge discovery in databases (KDD), is concerned with finding patterns in data. This paper introduces and illustrates the most common types of patterns considered by data mining approaches and gives rough outlines of the data mining algorithms that are most frequently used to look for such patterns. In this paper, we also to give an overview of KDD applications in environmental sciences, complemented with a sample of case studies. The latter are described in slightly more detail and used to illustrate KDD-related issues that arise in environmental applications. The application domains addressed mostly concern ecological modelling. Keywords: Data Mining; Knowledge Discovery; Decision Trees; Rule Induction; Environmental Applications; Eco- logical Modelling; Population Dynamics; Habitat Suitability;

Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Environmental Applications of Data Mining

Saso Dzeroski

Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia

Abstract

Data mining, the central activity in the process of knowledge discovery in databases (KDD),is concerned with finding patterns in data. This paper introduces and illustrates the mostcommon types of patterns considered by data mining approaches and gives rough outlinesof the data mining algorithms that are most frequently used to look for such patterns. Inthis paper, we also to give an overview of KDD applications in environmental sciences,complemented with a sample of case studies. The latter are described in slightly moredetail and used to illustrate KDD-related issues that arise in environmental applications.The application domains addressed mostly concern ecological modelling.

Keywords:Data Mining; Knowledge Discovery; Decision Trees; Rule Induction; Environmental Applications; Eco-logical Modelling; Population Dynamics; Habitat Suitability;

Page 2: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Fixed Rank Kriging for Massive Datasets

Noel CressieDepartment of Statistics

1958 Neil Avenue

The Ohio State University

Columbus OH 43210-1247

Abstract

Spatial modeling of massive data is challenging. The massiveness causes problems incomputing optimal spatial predictors such as kriging, since its computational complexityis cubic in the size of the data. In addition, a large spatial domain is often associated withmassive data, so that the spatial process of interest typically exhibits nonstationary behav-ior over that domain. In this paper, a flexible family of nonstationary covariance functionsis constructed using a set of basis functions fixed in number. This approach, which wecall Fixed Rank Kriging (FRK), results in computational simplification in deriving the bestlinear unbiased predictor (BLUP) and its mean squared prediction error for a hidden spa-tial process. A method is given to find the best estimator from this family of covariancefunctions, which is then used in the FRK equations. The new methodology is applied to alarge dataset of remotely sensed Total Column Ozone (TCO) data, observed over the en-tire globe. This research is joint with Gardar Johannesson, Lawrence Livermore NationalLaboratory.

Keywords:best linear unbiased predictor, covariance function, Frobenius norm, geostatistics, mean squared predictionerror, remote sensing, total column ozone

Page 3: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Inversion and Imaging for the Solid Earth

B. L. N. Kennett

Research School of Earth Sciences, Australian National University, ACT 0200, Australia.

Abstract All knowledge of the interior of the Earth is based on indirect inference. Even apparently simple tasks such as the location of seismic events are actually highly non-linear inverse problems with data inputs of various types and quality. Many problems involve either data dependency on multiple classes of parameters or many different sources of data associated with the same description of an Earth model. The result is that there has been a strong independent tradition of innovation in geophysical inverse problems, since conventional tools do not directly translate to the problems at hand. A major problem is the description of the 3-dimensional interior structure of the Earth using observations of seismograms at the Earth’s surface, which can rapidly lead to large numbers of parameter and data inputs. The dominant structure depends on radius and so progress has been made by developing reference models for the average radial structure of the seismic wavespeed in the Earth and then seeking the 3-D variations about this state. I will illustrate the successes and problems associated with the generation of such reference models and the current state of imaging for 3-D structure.

Page 4: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Data explosion: The challenges for Geoscience Australia P. McFadden

Geoscience Australia GPO Box 378, Canberra, ACT 2601, Australia

Abstract

Current data holdings in Geoscience Australia exceed 700 terabytes and are growing apace. Geoscience Australia has a responsibility to meet the geoscience and geospatial information requirements of the nation, and so these vast and rapidly increasing data holdings present a significant set of challenges. The cost just of storing such large amounts of data is a significant issue for agencies like Geoscience Australia that have a prime custodial role. There are then all the issues associated with ensuring that the information content within these data holdings is accessible and discoverable, and that information will not be destroyed as a consequence of choices made in how to store the data. For the scientist there is the critical question of how to structure data so that those data facilitate the answering of critical questions; if structured poorly vast data sets can obscure the relevant information content and swamp an investigator with irrelevancies. Within Geoscience Australia the different research groups and the Corporate Information Management and Access (CIMA) group are working to achieve the best responses to these challenges.

Keywords: ???

Page 5: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Data Mining Geoscientific Data Sets Using Self Organizing Maps

S.J. Fraser(1), B.L. Dickson(2)

(1) CSIRO Exploration & Mining , QCAT PO Box 883 Kenmore 4069, Australia, [email protected] (2) Dickson Research Pty Ltd, 47 Amiens St, Gladesville, 2111, [email protected]

Abstract

Geoscientists are increasingly challenged by the joint interpretation of ever-expanding amounts of new and historic, spatially-located exploration data (e.g., geochemistry, geophysics, geology, mineralogy, elevation data, etc.). And because, we can gather data faster than it can be interpreted, the availability of geographic information systems (GIS) has, to some extent, compounded, rather than reduced this problem. Research into the analysis and interpretation methods for data held in a GIS is in its infancy. A limited number of “advanced” interpretation methods have been developed; however, these often rely on a priori knowledge, training, or assumptions about mineralisation models. Objective, unsupervised methods for the spatial analysis of disparate data sets are needed.

We have investigated and developed a new computational “tool” to assist in the interpretation of spatially located mineral exploration data sets. Our procedures are based on the data-ordering and visualization capabilities of the Self Organizing Map (SOM), combined with interactive software to investigate and display the spatial context of the derived SOM “clusters”. These computational procedures have the capacity to improve the efficiency and effectiveness of geoscientists as they attempt to discover and understand the often subtle signals associated with specific geological processes (e.g., mineralization), and separate them from the effects of overprinting noise caused by other processes such as metamorphism or weathering.

Based on the principles of “ordered vector quantization”, the SOM approach has the advantage that all input data samples are represented as vectors in a data-space defined by the number of observations (variables) for each sample. The SOM procedure is an exploratory data analysis technique whereby patterns and relationships within a database are internally derived (unsupervised) based on measures of vector similarity (e.g., Euclidean distance and the dot product). The outputs of a SOM analysis are highly visual, which assists the analyst in understanding the data’s internal relationships.

Keywords: Self Organizing Maps, Data Mining, Geosciences

Introduction Geoscientists in general and explorationists in particular, commonly suffer data overload. Volumes of open-file reports, digital geological maps, geophysical data sets, and remotely sensed data are typically available. When these data are combined with the results of current exploration activities, serious data-overload problems can occur. Geographic Information Systems (GIS) and their capacity to store spatially located (digital) data have not necessarily assisted in the interpretation of data. More often than not, the incorporation of data into a GIS is seen as the ‘goal’, whereas in reality, it is only the first step in the data analysis and interpretation procedure. GIS are important in that they allow spatially located exploration data to be stored in a database (ideally with checks as to the data’s validity and integrity). However, the mechanisms for data interpretation in such systems, have not kept pace with the enthusiasm with which data can be collected and stored. Traditional multivariate statistical approaches are often confused by data sets with variable relationships that are non-linear, by data distributions that are non-normal (typically with multiple populations), and by the data sets themselves that may be disparate, sparsely-filled, (contain “nulls”), with both continuous and discontinuous numeric data and text. The SOM, ordered vector-quantization approach can overcome many of these problematic issues. A number of “advanced” interpretation methods have been developed for the GIS environment, such as “Weights of Evidence” (see, Bonham–Carter and Agterberg, 1999), “Neural Networks” (see Brown et. al, 2000) and other “Expert Systems”. These “advanced” methods often rely on a priori knowledge, training, or a subjective approach (assumptions about mineralization models, and the probabilities as to the significance of particular occurrences or features), which may or may not exist, be relevant, or available. There are very few techniques that enable a user to explore and analyze the relationships between the various data-layers stored in a GIS in an objective quantitative fashion.

Page 6: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

The authors have an ongoing interest in the development of tools and techniques to assist in the integrated analysis, interpretation and visualization of various exploration and mining related data sets, especially those with spatial or geographic attributes. For some time, they have been promoting the use of Self Organizing Map (SOM - Kohonen, 2001) as a knowledge discovery or exploratory, data analysis tool. The Self Organizing Map procedures are described in detail elsewhere (Kohonen, 2001). Briefly, however, if one represents all sample points as vectors in a data-space defined by the number of observations, the SOM procedure provides a non-parametric mapping (regression) that transforms an n-dimensional representation of these high dimensional, nonlinearly-related data items to a typically two-dimensional representation, in a fashion that provides both an un-supervised clustering and a highly visual representation of the data’s relationships. SOM procedures are used in a range of applications, but they are having a major impact in the fields of data exploration (Kaski, 1997) and data mining (Vesanto, 2000). The SOM procedures being developed by the authors are aimed at providing geoscientists with access to new methods for determining the intricate relationships, within and between multiple, spatially-located and complex data sets. The analysis and visualization provided via SOM has the potential to be significant to both spatial and non-spatial investigations involving both resource discovery and utilization. Except for some specific applications, SOM procedures have not been widely accepted in the geosciences nor used by the exploration and mining industries. Because of the relatively recent development of the technique, there is much to learn about the potential and application of SOM for analyzing resource related data sets. SOM has been widely used for data analysis in the fields of finance, speech analysis, astronomy (see Kaski et. al, 1998; Garcia-Berro et al., 2003) and more recently in petroleum well log and seismic interpretation (Strecker & Uden, 2002; Briqueu et al., 2002) and geochemisrtry and hyperspectral data (Penn, 2005). The SOM technique has characteristics and capabilities that make it ideal for geoscience applications, including:

• An ability to identify and define subtle relationships within and between diverse data, such as continuous (e.g., geophysical logs) and categorical (e.g., rock-type) variables;

• No required prior knowledge about the nature or number of clusters within the data (unsupervised); • No assumptions about statistical distributions of variables or linear correlations between variables; • Robust handling of missing and noisy data; • Additional analysis tools, such as component analysis, spatial analysis and the ability to use a pre-computed

SOM as a classification framework for a new dataset.

Results Three examples using our SOM approach for the analysis of geoscience data sets shall be presented. The first study used SOM to perform an analysis on some 40,000 located geochemical samples from drill-holes around a known copper-gold deposit. Each sample was assayed for up to 13 elements; however 60% of the variable cells were nulls. A consequence of the data being collected over a 10 year period, with different element suites being used as the paragenetic model for the deposit evolved. Three main gold populations were highlighted using the SOM procedure within the data set. The first we propose relates to transported particulate gold within overlying Mesozoic sediments, the second to hydromorphically transported gold that is being moved into the overlying sediments; and the third relates to gold at the interface/ unconformity between the overlying sediments and the basement lithologies. The SOM procedure was also able to highlight three spatial groupings of anomalous gold values. One was considered to be extensions to the known mine mineralization; the second related to a known prospect some four kilometers away from the mine; while the third occurs some 25km away in a scout drill hole that was part of a regional grid, drilled during regional evaluation. The geochemical samples from this third region, were not assayed for the same element suite for the holes around the known mine, but were based on an earlier, superseded model for mineralization in the area. The SOM procedure however, was able to assign those samples to similar groupings of samples around the mine, despite the fact that key elements were not assayed for in those samples. In the second study over another Au-prospect, geochemical assay measurements were supplemented by a geologist’s logged alteration descriptions. In this case the alteration descriptions were used as labels and not actually included in the SOM analysis. Two distinct high Au associations were delimited by the SOM analysis of some ten elements for each sample. One Au-association was related to high Ag values; the other Au association was related to only moderate Ag values. These two Au populations when plotted spatially form coherent spatial patterns. On a scatter plot of Au and Ag, values coloured by their SOM-assigned groupings, a distinct trend could be observed that we believe indicates the “process of mineralization”. This information can be used on spatial plots as a “vector-to-ore”. When the alteration

Page 7: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

labels were overlain onto points on the scatter plot, there is a general trend evident from poorly mineralized propylitic samples through to highly mineralized samples exhibiting silica flooding; though not all samples logged as “silica-flooding” were highly mineralized. Our third study involves the use of SOM to assist in the analysis of hyperspectral reflectance data acquired by the HyLogger core–logging system on coal cores. The SOM procedure was applied to approximately 40,000 spectra, each with 522 channels of spectral values, to find natural “groupings” within these data that could be related to facies within the sediments and layering within the coal. In this case the SOM was used to simplify a very complex data set by “clumping” the data into meaningful packages that could be related to the geology by the domain analyst.

Discussion In each study, the SOM procedure provided fundamental new knowledge, or assisted in simplifying the complexity present within these data, to assist in their analysis and interpretation. In both the first and second examples, the SOM visualizations alerted the analyst to evidence of geological processes present in the data, which assisted in their interpretation. In the third example, the SOM was used to simplify a complex data set into patterns that could be related to the coal sequence sedimentary packages. These capabilities are valuable contributions towards the analysis of geoscientific data sets, which are further enhanced by an ability to display the SOM outputs in their spatial context. The SOM procedure is an exploratory data analysis technique that derives the patterns and relationships within a data set in an unsupervised fashion based on measures of vector similarity (e.g., Euclidean distance and the dot product). The outputs of a SOM analysis are highly visual, which assists the analyst in understanding the data’s internal relationships, and relating them to geological processes.

References G.F. Bonham–Carter and F.P Agterberg: Arc-WofE: a GIS tool for statistical integration of mineral exploration

datasets. pp. 497–500, Proceedings International Statistical Institute, Helsinki, August 11–16, 1999. L. Briqueu, S. Gottlib-Zeh, M. Ramadan, and J. Brulhet:Traitement des disgraphies a l’aide d’un reseau de neurons du

type <<carte auto-organisatrice>>: application a l’etude lithogique de la couche silteuse de Marcoule (Gard France). C.R. Geoscience 334 (2002) 31-337. 2002.

W.M. Brown, T.D. Gedeon, D.I. Groves, and R.G. Barnes; Artificial neural networks: a new method for mineral

prospectivity mapping. Australian Journal of Earth Sciences; (2000) 47, 757-770, 2000. B.L. Dickson, D.A. Clark D.A., and S.J.Fraser: New techniques for interpretation of aerial gamma-ray surveys. Final

Report Project P491. CSIRO Exploration and Mining Report 653R, 20 pages, includes a CD ROM. 1999. E. Garcia-Berro, S. Santiago-Torres, and J.Isern: Using self-organizing maps to identify potential halo white dwarfs.

Neural Networks 16 (2003) 405–410. 2003 S. Kaski: Data exploration using self-organizing maps; Acta Polytechnica Scandinavica, Mathematics, Computing and

Management in Engineering Series No. 82, Espoo 1997, 57 pp. Published by the Finnish Academy of Technology, 1997.

S. Kaski, J. Kangas, and T. Kohonen: Bibliography of Self-Organizing Map (SOM) Papers: 1981--1997, Neural

Computing Surveys, 1: 102-350. Available from http://www.icsi.berkeley.edu/~jagota/NCS/. 1998. T. Kohonen: Self-Organizing Maps. Third Extended Edition, Springer Series in Information Sciences, Vol. 30,

Springer, Berlin, Heidelberg, New York, 2001. T. Kohonen: Self Organized Formation of Topological Correct Feature Maps. Biol. Cyberbetics. Vol 43, 1982, pp.59-

96, 1982. B. S. Penn: Using Self-Organizing maps to visualize high-dimensional data. Computers and Geosciences 31, 531-544.

2005. U. Strecker, and R. Uden: Data mining of poststack seismic attribute volumes using Kohonen self-organizing maps.

The Leading Edge, October 2002, pp1032 -1037. 2002

Page 8: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Dealing with unknown discontinuities in data and models

Kerry Gallagher (1) , John Stephenson(1), Chris Holmes(2)

(1) Dept. of Earth Sciences and Engineering, Imperial College London, London, England (2) Dept. of Statistics, University of Oxford, Oxford, England.

Abstract

The Earth is characterised by variability on many scales, and the significance of these scales depends on the particular problem under consideration. Natural spatial discontinuities, such as faults and lithological boundaries, separate regions within which the physical properties, processes or geological evolution may be similar. Similarly, time series may show very rapid changes (effectively discontinuous) either in the actual signal, or the underlying process over time, such as proxy records for palaeoclimate and climate itself. In the most general problem, we may want to deal with both spatial and temporal discontinuities, or where the relationship between spatial discontinuities may change with time. Another situation is where we may want to improve inference of a model or process by combining analyses samples collected at different locations. However, it is generally not be obvious how best to cluster these samples, given they may contain an unknown (and so potentially different) record about the processes of interest.

We can propose a solution to such problems in terms of inferring the locations (in space or time) of an unknown number of discontinuities in either data or a model of the underlying process. This is known as partition modelling (or changepoint modelling in 1 dimension). It is conveniently posed in a Bayesian formulation, and solved used reversible jump Markov chain Monte Carlo (RJMCMC), the transdimensional form of conventional MCMC. This approach allows efficient sampling of variable dimensional model spaces, in which we only need to specify the maximum number of dimensions. The Bayesian approach has the advantage of parsimony, so that we avoid producing overly complex models, while still achieving a satisfactory fit to the data. Furthermore, we can average models over variable of fixed dimensions, which provides a natural smoothing to the ensemble of discontinuous models, avoiding the need to specify smoothing functions.

Page 9: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Exploiting the data explosion using geographically local analyses

S. W. Laffan(1)

(1) School of Biological, Earth and Environmental Sciences, The University of New South Wales, Sydney, 2052, Australia.

Abstract

The explosion in spatial data allows for analyses of much finer spatial detail than was previously possible for applications as diverse as geochemistry, geophysics, crime, epidemiology and biodiversity. In many cases the spatial density of samples makes the application of geographically local analyses routine. Such analyses evaluate a model at each sample location, producing a surface of models and associated diagnostics. This allows far greater insight into the nature of the association between a set of variables than is normally provided by geographically global analyses. In particular, the nature of any correlations can be assessed as they change in different parts of a landscape. Any analysis method can be adapted to be geographically local, but one must still be mindful of the pitfalls of spatial data such as the curse of dimensionality and the fact that geographic data are normally correlated and therefore violate a basic assumption of many conventional statistical techniques. In this talk I will describe some recent developments using a geographically local implementation of the Sparse Grids analysis system to analyse a data set of 57,642 drill cores from the Weipa bauxite deposit in Queensland, Australia. Sparse grids are particularly suited to the analysis of geographic data. They do not assume the data are uncorrelated, they can fit flexible functional forms, and are less susceptible to the curse of dimensionality. The results will be compared with the more commonly used Geographically Weighted Regression (GWR) system, which implements a set of geographically local linear regression models.

Keywords: Spatial analysis; Geographically local analysis; Sparse Grids; Geographically Weighted Regression.

Page 10: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Computational Frameworks enabling multi-scale multi-physics models

B. Appelbe(1) , S. Quenette(1)

(1) Victorian Partnership for Advanced Computing, P.O. Box 201, Carlton South, VIC 3053, Australia

Abstract In computational models, there is an increasing need for coupling: ranging from coupling at the equation level, to tensor-level coupling and field coupling. Coupling can occur either on model boundaries, or throughout the model. Traditionally, such coupling has been done on an ad-hoc basis, and coupling of traditional computational codes can be so difficult as to require almost complete rewriting of the codes to facilitate coupling. The computational codes we have developed for ACCESS and CIG, such as Snac and Snark, were developed with coupling in mind, and as we have evolved these codes, the support for coupling has gradually improved and become both simpler and yet more powerful. This paper will present our experiences in coupling models and equations within the StGermain Framework, and the lessons we have learned from implementing such coupling.

Keywords: Model Coupling; Multi-scale; Multi-physics; Frameworks

Introduction Contemporary research funding has a significant focus on facilitating e-Research. This entails enabling geoscientists to access vast amounts of data located anywhere around the world. It entails enabling geophysicists to utilise vast amounts of computational cycles, also, located anywhere around the world. An often overlooked, but equally relevant facet is that e-Research is enabling modellers to construct increasingly complicated simulations, through the guided application of software “best practices”. All three facets are a producer of, and a consumer of masses of data in the Earth sciences. Our group is primarily focussed on the latter facet of e-Research. That is, facilitating the construction of higher fidelity, more sophisticated computational models of phenomena. In terms of geodynamics, this focus includes the Snark and SPModel projects of the ACcESS MNRF, a consortium focussed on providing Australia with a common resource for geodynamics modelling. We also partake in the e-Research facilitation of CIG, a recently established equivalent organisation within the United States, and its precursor – GeoFramework, through the Snac project. The evolution of the three software projects, Snark, SPModel and Snac is instructive. They all began with the scientifically modest goals of creating parallel, 3D versions of existing codes of well-established phenomena (mantle convection, erosion, and crustal deformation respectively). However the natural evolution of models, facilitated by effectively applied software engineering, has pushed the focus to multi-phenomena problems, entailing multi-scale, multi-physics capabilities. This includes refining the fidelity of existing phenomena models, as well as encapsulating the effects of two or more phenomena within the one model. Some examples are:

1. Lithospheric to mantle models 2. Embedding high-resolution regional models within low-resolution global models (e.g., mantle plumes within a

global mantle wind model) 3. Coupled advection/diffusion models (e.g., magma melt) 4. Surface process to Lithospheric models

Naively, one might imagine that coupling two exiting codes together (“code-coupling”) is a simple problem of feeding the output of one code as input into another. However, in general, coupling is far more difficult than this, and is complicated by differences in time and length scales, assumptions in the constitutive models and numerics utilised. Rather, the choice in coupling regime for a multi-phenomena problem is a function of constraints implied by each phenomenon model. Some models are sensitive to numerical error, some require accurate interface tracking, some are biased for execution speed, and so on. For example, one model may best suit, and hence be implemented by an explicit Lagrangian approach (e.g. a Lithospheric code), but its coupling counterpart may be implicit Eulerian implementation (e.g. a mantle code). In this case, neither code suits the counterpart’s numerics. Hence field coupling across the two existing codes may be best. But this is not always the case.

Page 11: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Representative diagrams of the three styles of multi-scale, multiphysics coupling discussed.

There are Lithospheric phenomena that also suit an implicit Eulerian implementation, and in this case, the discretisation is not incompatible with the mantle convection’s implementation. Consequently equation coupling can be used. That is, both phenomena are modelled in the same code on the same domain. The issue then becomes implementing the different material physics and tuned numerical methods for the both regions respectively. Another example is where one phenomenon is individually modelled at two different scales. For example, where the constitutive update of a given region of interest in the larger scale is actually resolved by modelling that same region in a separate representative domain with physics and numerics relevant to that scale (fine model in coarse model). Software frameworks can help reduce the time, cost and effort in developing applications. StGermain is the foundation of a layered framework targeted at creating computational codes. This covers numerical, physics and computational sciences by providing people within these disciplines a common framework, and hence a medium for collaborative development. This same infrastructure provides a means to facilitate the coupling regimes described above. The result is an evolution in the ability to model real world phenomena, bringing us closer to modelling problems of real world relevance. We will discuss our experiences in enabling multi-scale, multi-physics modelling through this framework.

References

StGermain: https://csd.vpac.org/StGermain Snac: http://geoframework.org/twiki/bin/view/Snac.WebHome Snark: https://csd.vpac.org/Snark Underworld: http://wasabi.maths.monash.edu.au/twiki/view/Codes/UnderWorld Pyre: http://www.geodynamics.org:8080/cig/software/pyre S. M. Quenette and B. F. Appelbe and M. Gurnis and L. J. Hodkinson and L. Moresi and P. D. Sunter. An investigation into design for performance and code maintainability in high performance computing. Proc. of 12th Computational Techniques and Applications Conference CTAC-2004, 46:C1001-C1016 Sulsky D., Chen Z. and Schreyer H. L., A particle method for history-dependent materials. Comput. Methods Appl. Mech. Engrg. 1994, 118:179-196. Moresi L., Dufour F. and Muhlhaus H.B. A Lagrangian integration point finite element method for large deformation modeling of viscoelastic geomaterials. J. Comput. Phys. 2003, 184:476-497. V.Kouznetsova, M.G.D Geers and W.A.M. Brekelmans. Multi-scale constitutive modelling of heterogeneous materials with a gradient-enhanced computational homogenization scheme. International Journal for numerical methods in engineering 2002, 54:1235-1260 Gurnis M., C. Hall, L. Lavier. Evolving force balance during incipient subduction, Gcubed 2004, 5

Page 12: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Linking Observations to Subduction Process Modelling

M. Sdrolias(1), R.D. Müller(1), M. Gurnis(2)

(1)School of Geosciences and University of Sydney Institute of Marine Science (USIMS), Edgeworth David Building F05, University of Sydney, NSW, 2006, Australia

(2)Seismological Laboratory, California Institute of Technology, Pasadena, CA, 91125, USA

Abstract

Understanding the initiation and processes governing subduction remains one of the greatest challenges in geodynamics. Subduction affects every aspect of the earth system and it is generally agreed to be one of the primary driving forces of plate tectonics and mantle convection through slab pull and the addition of raw materials into the mantle. Previous attempts to numerically model the initiation and development of a self-sustaining subduction system have relied on instantaneous snapshots and theoretical boundary conditions not well constrained by geological and geophysical observations. However, subduction zones are extremely dynamic and have continuously changing shapes, locations, orientations and physical properties through time. While computer simulations have provided useful insights into some of these problems, the lack of well-integrated observational constraints has limited previous models to various 2D or 3D simplifications. We have created a subduction database comprising a detailed global study of various subduction zone parameters, including: the age of the subducting oceanic lithosphere; convergence rate and direction; back-arc spreading rates; the absolute motion of the overriding and downgoing plates; and the dip angle of the subducting slab. These observational constraints are used as boundary layer input into 3D spherical mantle convection models using CitComS to achieve more realistic models of subduction initiation and development. The results of our models will have implications for our understanding of the subduction factory, mantle convection and will have applications for the exploration of ore deposits in convergent margin settings.

Keywords: Subduction; Back-arc Basins; Mantle Convection

Page 13: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Lithosphere-Hydrosphere interactions: Stokes flow with a freesurface

J. Braun(1), P. Fullsack(2), M. DeKool(3)

(1) Geosciences Rennes, Universite de Rennes 1, F-35042 Rennes cedex, France

(2) Department of Oceanography, Dalhousie University, Halifax, NS, B3H 4J1, Canada

(3) Research School of Earth Sciences, The Australian National University, Canberra, ACT 0200, Australia

Abstract

The coupling between surface processes (erosion, transport and sedimentation) and tec-tonics has emerged as one of the dominant processes that determines the large-scale mor-phology of mountain belts. The transport of mass at the Earth’s surface affects the stateof stress within the Earth’s interior and, consequently, its response to tectonic processes.That erosion, and thus climate, are important players in determining the dynamical be-haviour of the solid Earth has been demonstrated by sophisticated numerical models ofthe coupled lithosphere-hydrosphere system. The coupling is, however, difficult to rep-resent by traditional numerical methods as it requires the accurate tracking of the freesurface through deformation events that can easily lead to strain accumulation of severalthousands of percent and the relative motion of parts of the model by thousands of kilome-ters. We first describe here various methods that have been developed in recent years toaddress these difficulties. We subsequently present a newly developed numerical modeldesigned to address these challenges in a three dimensional framework. To overcome thegeometrical complexity of the problem, we have used an octree division of space in whicharbitrary surfaces are embedded. These surfaces have a dual representation based on adynamically evolving cloud of Lagrangian particles residing on the surface and a level setfunction (lsf) defined at the nodes of the octree. This dual approach allows us to combineaccuracy and efficiency. The method has been tested against other numerical methodsand analogue experiments.

Keywords:Challenges in computational simulation of natural processes ; Hydrosphere-Lithosphere-Mantle system

Introduction: existing methods

Modelling the deformation of the solid Earth over geological time scales is a challenging problem that requires accuratemethods to compute the velocity/deformation field within the Earth’s interior but also the geometry of interfaces,such as the free surface or the crust-mantle boundary (the so-called Moho discontinuity) that are advected by thedeformation. Small variations in the geometry of the free surface lead to important perturbations of the stress fieldand impact on the nature of the flow in the underlying crust. For example, it is because the pressure gradients causedby topographic slope are capable of driving lower crustal flow that topographic slope never reaches values greaterthan a few degrees (when measured at the scale of an orogenic system) (Willett et al., 1993). Conversely, because itis characterized by a smaller density contrast than the free surface, the Moho discontinuity can be greatly deformedby tectonic processes. For the numerical modeller, the challenge is thus to accurately predict the geometry of thesedeforming surfaces, as well as the effect that their complex geometries can have on the flow. Most methods developedso far have been limited to two dimensions (Fullsack, 1995; Braun and Sambridge, 1994), mostly due to the complexityof the problem and computational cost.Lithospheric flow is also characterized by very large displacements and deformation, in part due to the complex, non-linear and localizing nature of rock rheology, which justifies the use of an Eulerian approach in which the numericalmesh is fixed in space. A Lagrangian approach (in which the numerical mesh is advected with the flow) is, however,much better suited to the tracking a deforming surface. This is why mixed or Arbitrary-Lagrangian-Eulerian methodshave been successfully used in the past (Fullsack, 1995), but were limited in their resolution due to the regular natureof the Eulerian meshes used.

New 3D method

In the past two years, members of several research teams have collaborated to develop a numerical method that buildson previous experience in methodological development, with the purpose to achieve sufficient accuracy and efficiencyto permit three-dimensional analysis. The new method is based on an octree division of space, the tracking of deforming

Page 14: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Figure 1: Non uniform, octree division of the unit cube; the leaves of the octree are drawn in a colour proportional totheir size.

interfaces by using Lagrangian particles and level set functions, and a modified finite element approach based on anoctree division of the element to perform volume integrals.

Octree division of space

We use an octree division of space to discretize a unit cube. The final discretization is made of cubes of non-uniformsize and distribution (the ‘leaves’ of the octree) but allows for efficient ‘navigation’ through the leaves. For example,it is trivial to find the index of the leaf containing an arbitrary point. It is thus computationally efficient to interpolatea field onto the nodes of an octree. In Figure 1, we show a very simple octree that has been locally refined to level10 (the size of the sides of the smallest leaves of the octree is2−10). We solve the three-dimensional form of Stokesequation using the finite element method in which the leaves of the octree are the elements. Where two adjacent leavesof the octree are of different level, hanging nodes do appear; these are nodes that are not connected to all neighbouringelements. To overcome this limitation of the octree discretization we resolve the geometric mismatch by imposingsimple linear constraints forcing the flow to vary linearly across the faces where the mismatch occurs.

Surface Tracking

Surfaces are discretized by a series of particles that are strategically positioned on each surface and advected with theflow. The local density of the particles is maintained by injection and removal of particles according to a set of criteria,based on inter-particle distances and the curvature of the surface (measured by the divergence of the normals at theparticle locations). To each surface is also associated an octree, the resolution of which is a function of the distance tothe surface (smallest leaves in the vicinity of the surface). A level set function (lsf) is also computed on the nodes of theoctree. Its value is defined as the ‘signed distance’ to the surface. A collection of octrees and lsf’s are thus constructed,one for each of the surfaces to be tracked. A master octree is constructed as the union of each of the surface octreesand a complete set of lsf’s is computed at each of its nodes.

Finite element representation

This master octree is used to construct the finite element matrices. The lsf’s are used (locally) to determine the positionof the nodes of each leaf (element) with respect to all of the surfaces. Elements can be of two types: those that areentirely within one ‘medium’ defined as the material comprised between two successive surfaces, and those that are cutby at least one surface. The contributions of those ‘cut elements’ are estimated by computing volumetric integrals ofthe finite element equations using an octree division of the element. As shown in Figure 2, the cut elements are dividedin smaller cubes that are sequentially tested for intersection by the surfaces (using interpolated values of the variouslsf’s). This method, which we termed divFEM, insures that we can estimate efficiently and accurately volume integrals

Page 15: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Figure 2: Division of the element/leaf illustrated by a simplified two-dimensional diagram. By successive octreedivision of the leave, the algorithm identifies sub-cubes that are entirely comprised in one of the two media on eitherside of the surface cutting the element. A simplified analytical expression is used to estimate the relative volume of thesmall cubes cut still by the interface.

a) Triangular + Conformal FE

x-distance

0.0 0.2 0.4 0.6 0.8 1.0

Hei

ght

0.0

0.1

0.2

0.3

0.4

0.5

0.6

b) Comparison at t = 20

x-distance

0.0 0.2 0.4 0.6 0.8 1.0

Hei

ght

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

Triangle (Step 19)Gcube (Step 11)

Figure 3: a) Evolution of the free surface of the fluid following a large amplitude, sinusoidal perturbation, as computedby our method. b) Comparison with a high resolution, 2D Lagrangian finite element solution.

of a function that may vary abruptly within an element, such as those generated at the free surface where the materialproperties vary greatly.

Comparison to other numerical methods

To verify the accuracy of our method, we computed the evolution of the free surface of a highly viscous fluid that wasinitially set to be a periodic sine function of amplitude comparable to its wavelength (Figure 3a). Gravitational forcesdrive the surface to a flat geometry but following a scenario that leads to the formation of a cusp on the initially lowside of the surface. We also computed the solution of this inherently two-dimensional problem with a Lagrangian finiteelement method. The results are shown in Figure 3b and demonstrate that the octree/lsf/divFEM based method is veryaccurate. We also performed a series of analogue experiments designed to test the accuracy of our method to track thedeformation of the free surface and its effects on the underlying flow.

References

J. Braun and M. Sambridge. Dynamical Lagrangian Remeshing (DLR): A new algorithm for solving large straindeformation problems and its application to fault-propagation folding.Earth Planet. Sc. Lett., 124:211–220, 1994.

P. Fullsack. An arbitrary lagrangian-eularian formulation for creeping flows and its application in tectonic models.Geophys. J. Int., 120:1–23, 1995.

S.D. Willett, C. Beaumont, and P. Fullsack. Mechanical model for the tectonics of doubly-vergent compressionalorogens.Geology, 21:371–374, 1993.

Page 16: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Simulating the effect of Tsunamis within the BuildEnvironment

S. G. Roberts(1), O. M. Nielsen(2)

(1) Department of Mathematics, Australian National University, Canberra, ACT, 0200, Australia

(2) Risk Assessment Methods Project, Geospatial and Earth Monitoring Division, Geoscience Australia,

Symonston, ACT, 2609, Australia

Abstract

Impacts to the built environment from a hazard such tsunami are critical in understandingthe economic and social effects of such events on our communities. In order to betterunderstand these effects, Geoscience Australia and the Australian National University aredeveloping a software modelling tool for the simulation of inundation of coastal areas bytsunamis. The tool is based on solving the Shallow Water Wave equation using a finite-volume method based on unstructured triangular grids with fluxes calculated using a cen-tral scheme. An important capability of the method is its ability to model the process ofwetting and drying as water enters and leaves an area. This means that it is suitable forsimulating water flow onto a beach or dry land and around structures such as buildings.It is also capable of resolving hydraulic jumps, due to the ability of the central schemeto handle discontinuities. This talk will describe the mathematical and numerical modelsused, the architecture of tool and the results of a series of validation studies, in particularthe comparison with experiment of a tsunami run-up onto a complex three-dimensionalbeach.

Keywords:Tsunami; Shallow Water Wave equation; Finite Volume Method

Page 17: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Simulation of Lava Dome Growth Considering shear thinning, thermal feedback and strain softening

Alina Hale, Hans Mühlhaus, Laurent Bourgouin

Earth Systems Science Computational Centre (ESSCC) Australian Computational Earth Systems Simulator (ACcESS) Level 8, Sir James Foots Building (47a) Corner College and Staff House Roads The University of Queensland PO Box 6067 St Lucia, QLD 4072 Tel: +61 7 3346 4110, Fax: +61 7 3346 4134, e-mail: [email protected] For a greater understanding of the flow properties of highly viscous crystalline-rich magma during ascent and in Peléean lava dome formation Finite Element Method (FEM) models have been developed. These models consider the fundamental controls on the eruption dynamics and the different growth styles (endogenous and exogenous). In endogenous dome growth the interior is a thermo-mechanically continuous structure, whilst for exogenous dome growth lava is extruded directly to the free surface due to the influence of faults. Transition between these two growth regimes are observed to occur for many lava domes and often denotes a significant change in the growth dynamics and a propensity for the dome to collapse. The dome growth regime is governed by the rheology of the lava and the flow rate from the feeding conduit. At the lowest extrusion rates the extruded lava is highly crystalline and dome growth is predominantly exogenous, probably \it via the channeling of lava along structural discontinuities within the dome. This process is not understood quantitatively but it is thought to be due to shear planes, formed following brittle failure, originating at the conduit edge where the shear stresses experienced between new lava entering and existing lava is greatest. The development of these structural discontinuities ultimately governs the growth style and may also be responsible for shallow earthquake activity. An axi-symmetrical FEM model has been developed for generic dome growth based on the parallelized finite element based PDE solver eScript/Finley (Davies, Gross and Muhlhaus, 2004). The lava viscosity is known to depend upon temperature, pressure, crystal content and water content and this is modelled using empirical data specific for the lava extruded from the Soufrière Hills Volcano. In our simulation we investigate the influence of thermal feedback due to shear (viscous) heating within the conduit and dome and its subsequence influence upon the flow profile. The models also consider the influence of the strain rate using a power-law viscosity (shear-thinning). Our model equations are formulated in an Eulerian framework and the evolution of the free surface of the lava dome is modeled using a level-set method (Tornberg and Enquist, 1999). We demonstrate that the formation of internal shear bands can be triggered by the inclusion of rate independent plastic deformations and strain softening.

1. Davies, M., Gross, L., Mühlhaus, H.–B., 2004, Scripting High Performance Earth Systems Simulations on the SGI Altix 3700, Proc. 7th Intl Conf. on High Performance Computing and Grid in Asia Pacific Region, 244-251. 2. Tornberg, A-K and Engquist, B (2000), A finite element based level-set method for multiphase flow applications. Comput. Visual Sci. 3, 93-101

Page 18: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Detection and Characterisation of Seafloor Evolution fromSonar Sensor Data

D.H. Smith

CSIRO Marine and Atmospheric Research, 233 Middle St. Cleveland QLD Australia 4163

Abstract

As part of the Great Barrier Reef Seabed Biodiversity Mapping Project, large quantitiesof acoustic data derived from a single beam sonar sensor have been generated duringseveral research vessel cruises, limited subsets of which are accompanied by underwatervideo imaging. Such data constitutes a plethora of seafloor signatures containing informa-tion on a range of properties relevant to benthic habitat studies such as depth, vegetation,sediment, hardness and roughness. Extraction of these properties and their evolution be-haviour from the available data is a key component in this study, and several tools are beingapplied, including discrete wavelet/packet transforms and the singular value decomposi-tion. This presentation will outline and demonstrate the application of these techniques toselected data portions, both large and small, indicating basic insights gained with a partic-ular focus on evolution behaviour. Future classification goals will also be discussed, withan emphasis on feature extraction and dimension reduction via an appropriate choice ofbasis in which the data will be represented.

Page 19: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Understanding basin evolution using global data sets

Christian Heine, R. Dietmar Muller(1)

(1) School of Geosciences and University of Sydney Institute of Marine Science (USIMS), Edgeworth David

Building F05, Eastern Ave., Main Campus, The University of Sydney NSW 2006, Australia

Abstract

The formation and evolution of broad intraplate sedimentary basins is usually attributedto failed, unsuccessful rifting followed by thermal cooling and subsidence of the litho-sphere. However, the tectonic subsidence history of basins such as the West SiberianBasin, Central European basin or the Australian Canning and Eromanga Basins deviatesfrom that expected from a simple failed rift basin. Intraplate basins often form on a veryheterogeneous basement, sometimes referred to as “accretionary crust”. In this contextwe define accretionary crust as crust having formed in Phanerozoic times by a seriesof continent-continent, arc-continent or terrane-continent collisions, incorporating majorgeological boundaries or sutures. This means, the basins form on relatively new, youngcontinental lithosphere which is, due to its incorporated inhomogeneities, rheologicallyweaker than relatively old and stable continental lithosphere (e.g. shields and cratons).It appears, that the simple post-rift thermal subsidence model is not applicable to thosebasins on accretionary crust, and that we have to consider a broader variety of paramet-ers when modelling the basin history. Not only the architecture of the basin-underlyingsubstrate has to be accounted for but also a range of geodynamic processes, like theposition of these basins relative to mantle upwellings/downwellings or active plate bound-aries, the response to changes in relative plate motions and igneous processes. Muchof this information is already available, but the community is lacking tools and workflowimplementations to explore the large parameter space, extract the necessary data for agiven scenario and process the large amount of geological and geophysical informationand meta-data in such a way that a input for numerical models can easily be generated.We investigate the formation and evolution of basin regions as described above by ana-lysing their crustal structure, geology and plate tectonic history using a combination ofan open-source geospatial database (PostgreSQL with PostGIS), freely available data andplate tectonic reconstruction tools glued together by Python scripts and XML. The pur-pose of this analysis is to generate and store a large amount of observational data usingan automated workflow and derive a set of generalised parameters (e.g. thickness and2D/3D geometry of various crustal layers, Moho temperature, crustal extension factors, tec-tonic subsidence) that will be used as input for geodynamic basin modelling. Using thisworkflow we create a library of classes of basin formation scenarios and correspondingnumerical model outputs. This method facilitates a better understanding of the complexgeological evolution of intraplate basins.

Keywords:intraplate sedimentary basins; plate tectonics; global data analysis; geospatial databases; workflow

Page 20: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Application of interactive geological inversion techniques and grid computing to numerical

modelling of geological processes W. Potma(1&2), T. Poulet(1&2)

(1) Computational Geoscience for Predictive Discovery, CSIRO Exploration & Mining, ARRC Kensington,

Western Australia 6151. (2) Predictive Mineral Discovery CRC

Abstract

Numerical modelling of geological processes requires the quantification of a large number of (often unmeasurable) parameters in order to reproduce an observed physical behavior. Interactive geological inversion using genetic algorithms, grid computing and complex post processing can be combined with numerical-modelling to reverse engineer the key parameters which control the physical processes observed in the geological record. A series of customised tools and a workflow have been developed to perform this analysis and provide a quantified basis for predictive numerical modelling of geological processes to aid exploration geologists.

Keywords: Numerical Modelling, Interactive Geological Inversion; Genetic Algorithm; Grid Computing, Data Analysis.

Introduction 4D numerical modelling of geological processes requires the quantification of a large number of parameters in order to generate a valid result, or reproduce an observed physical behavior. Many of these variables cannot be directly measured, because they represent rock properties or speculated geometries at >10km depth, and must therefore either be inferred or tested as part of the numerical modelling process. Any given modelling scenario may require the testing of multiple geometries (fault orientations, rock layering etc.), physical rock properties (cohesion, tensile strength, friction angle, permeability etc.) and boundary conditions (pore pressures, deformation forces, fluid fluxes etc.). The potential parameter search space is infinitely large and even when individual variables are assigned only 8 possible values, parameter spaces as large as 158 may need to be searched in order to reproduce an observed mechanical deformation and fluid flow behavior in rock. We employ interactive geological inversion techniques (Boschetti & Moresi 2001), grid computing and semi-automated data retrieval and post-processing to enable rapid visual ranking and analysis of results in 2D, 3D & 4D. The tools and workflow we have developed significantly reduce the time required to explore the vast parameter space associated with reverse engineering of complex geological processes (such as vein and shear zone formation in gold ore systems).

Interactive Geological Inversion Interactive Geological Inversion techniques (Boschetti & Moresi 2001) are used to search, in an efficient way, the parameter space of the geological process being simulated. We generally use Genetic Algorithms (Goldberg 1989) to drive the search, however, some other algorithms are being investigated, such as Lipschitzian methods (Strongin & Sergeyev 2000), to try and limit the randomness of the process. Additional functions are included to optimise the search, such as reducing the parameter space with cutting equations. The inversion application generates different sets of parameters to test, and expects the user to provide some feedback about the quality model result for each set of parameters.

Grid Computing Each simulation can take around 15 hours to run and it is therefore essential to be able to run many different simulations in parrallel. Grid computing is an elegant way to do so as it allows a user to benefit from the computing power of many distributed machines without having to manage each machine or job individually.

Page 21: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Visualisation We also use Sammons Mapping (Sammon 1969) to visualise the relative similarity of the “N dimensional” parameter space in a 2D representation to help identify the locus (loci) of any emergent global minima (good model results). The aim of this process is to converge on a group(s) of model results which most closely represent the observed geological phenomena. We are then able to quantify the various parameter values (rock properties, boundary conditions etc.) which contribute to or control the observed mechanical and fluid flow behavior. These parameter values can then be applied to modelling what-if geological scenarios which can be used as a predictive tool to aid exploration geologists, rock mechanics engineers and geological process modellers.

The Next Problem We are now able to run more models than we can reasonably analyse, interpret and rank, resulting in a severe data analysis and feedback bottle-neck which must be overcome. At this stage we are exploring two options to overcome this bottle-neck:

1) The implementation of automated 4D (x,y,z + time) image recognition applications. This method must be capable of ranking 4D raster and vector data based on a set of user defined rules. Effective ranking relies on assessing the interplay between the spatial distribution and absolute values of up to 10 data output types across multiple 3D visualisations (in the same physical space but potentially different time). This produces a composite ranking result which is only as good as its defining rules and the resolution/perspective of the images being assesed. This method would potentially enable the “bad” model results to be identified and binned, leaving the user to concentrate on only the good results which require more complex manual analysis and ranking.

2) A non-visual primary data analysis method which analyses the primary numerical outputs from the models. These data are attached to the nodes and centriods of a 3D finite difference mesh, and identify which models exhibit the required numerical characteristics in the correct spatial (and temporal) location within the mesh.

Both methods require significant user input to establish the “rules” which will constitute a good vs bad result, however, this process is already undertaken by the user as part of the manual ranking process. The challenge is to translate the data analysis process which is currently a completely manual task, to a partly (or potentially completely) automated process. The difficulty of this task is compounded by the need for both the user and the data anlysis tool to “learn” and adapt the interpretation/ranking rules (and potentially method) as unexpected modelling results are received.

References

F. Boschetti and L. Moresi. Interactive inversion in geosciences. Geophysics 66, 1226-1234. 2001 D.E. Goldberg. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Publ. Co., Inc. 1989. J. W. Sammon, Jr. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5):401-409, May 1969 R.G. Strongin and Y.D Sergeyev. Global Optimization with Non-Convex Constraints, Sequential and Parrallel Algorithms. Kluwer Academic Publishers. ISBN 0-7923-6490-2, October 2000.

Page 22: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

A New Methodology for Addressing Nonlinear Inverse Problems and its Application to

Characterise a Real Petroleum Reservoir P.J. Ballester(1), J.N. Carter(2)

(1) Physical and Theoretical Chemistry Laboratory, Oxford University, South Parks Rd, OX1 3QZ, UK (2) Dept. of Earth Science and Engineering, Imperial College London, Prince Consort Rd, SW7 2AZ, UK

Abstract In Petroleum Engineering, reservoir management aims to maximise the profit from a hydrocarbon reservoir. Highly nonlinear numerical models, which describe the internal structure of hydrocarbon reservoirs, are used to make reservoir management decisions. These models are aimed at providing accurate predictions of the reservoir behaviour under different scenarios. This requires that the model parameters are calibrated so that the model reproduces all the available data. Given the impossibility of directly measuring these parameters in the field, one has to infer them from indirect measurements, such as the oil production rate at a given reservoir well. This is an example of a nonlinear inverse problem. This talk will describe a recently developed methodology for addressing nonlinear inverse problems and its application to the characterisation of a real petroleum reservoir. This methodology is based on a real-coded Genetic Algorithm which has been modified to run on a cluster of computers.

Keywords: Nonlinear Inverse Problem, Genetic Algorithm, Clustering, Parameter Estimation, History Matching, Petroleum Reservoir Characterisation

Introduction Reservoir Characterisation can be defined as the process of identifying a numerical model, the behaviour of which must be as similar as possible to that of the hydrocarbon reservoir under study. A key stage of this process is to condition the adopted reservoir model to dynamic data from measurements in wells (the historical production data). This nonlinear parameter estimation problem, known as History Matching, usually has multiple distinct solutions (i.e. calibrated or history matched models). These solutions will manifest themselves as distinct optima for some objective function and will be separated by regions of poor objective function value. In History Matching, the challenge is to identify all the high quality optima and sample the parameter space around them. This sampling gives rise to an ensemble of history matched models, which can thereafter be used to quantify uncertainty on production forecasts. It is useful to study each type of history matched model separately, as this will let us understand the production mechanism that may have occurred. We may also be able to identify measurements that would allow us to discriminate between the different types of history matched models. However, the task of discovering these types or clusters of models from the ensemble is very hard mainly due to the high number of model parameters involved.

Methods In this study, a new real-coded Genetic Algorithm (GA) is applied to history match a real petroleum reservoir using its recorded production data and a numerical model of the reservoir. This GA is implemented within a non-generational, steady-state scheme. In order to shorten the computation time, the solutions (instances of the model) proposed by the GA are evaluated in parallel on a group of 24 computers. All of the solutions generated by this parallel GA are finally analysed using a clustering algorithm. This algorithm does not require the number of expected clusters to be chosen in advance, and it is able to handle a very high number of model parameters. This is done to find the number of distinct solutions within the ensemble generated by the GA.

Discussion The application of the methodology to this nonlinear inverse problem yields a large improvement with respect to past studies in that reservoir, both in terms of the quality and diversity of the obtained history matched models.

Page 23: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

The best history matched models are shown in Fig.1. The methodology was able to identify 19 different types (clusters) of reservoir models compatible with the measured data. These results show that, despite the use of regularisation terms in the objective function, many distinct reservoir models may be obtained from reservoir characterisation studies. This suggests that it is more important to search for multiple solutions than is currently perceived by most of the Petroleum Engineering community.

Figure 1: Results from the inversion of a real petroleum reservoir, which was described by a numerical model with 82 parameters. Each column in the plot is a calibrated reservoir model and each row is a given model parameter. The cell colour indicates the scaled value of the associated model parameter. The clustering algorithm revealed 19 different types (clusters) of history matched reservoir models (note that models within a cluster are similar, whereas models from different clusters are dissimilar). This plot constitutes a graphical representation of the uncertainty in the reservoir characterisation.

Future Work An interesting issue for further research is to find out how well these estimated models predict future data. This validation would compare a forecast envelope of the estimated models with the two years of additional data that have now been collected. This forecast envelope could be determined in several ways, for instance, by selecting the best model for each cluster and running them forward in time.

References

P. J. Ballester. New Computational Methods to Address Nonlinear Inverse Problems. PhD thesis, Dept. of Earth Science and Engineering, Imperial College London, University of London, UK. 2005.

P. J. Ballester and J.N. Carter. An Effective Real-Parameter Genetic Algorithms with Parent Centric Normal Crossover for Multimodal Optimization. Genetic and Evolutionary Computation Conference (GECCO-04, Seattle, USA). Lecture Notes in Computer Science, Springer, 3102:901-913, 2004.

M. Sambridge and K. Mosegaard. Monte Carlo methods in geophysical inverse problems. Reviews of Geophysics, (40) 3:1–29, 2002.

Page 24: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Teleseismic imaging in southeast Australia using datafrom multiple high density seismic arrays

N. Rawlinson, B. L. N. Kennett

Research School of Earth Sciences, Australian National University, Canberra ACT 0200

Abstract

The recent proliferation of passive seismic array experiments in southeast Australia hasresulted in the dense coverage of Tasmania, Victoria and parts of South Australia andNew South Wales by some 260 plus seismometers in less than a decade. In total, sixseparate temporary deployments have been carried out in order to build this large networkof instrumentation. Teleseismic tomography, which exploits relative arrival time residualsfrom distant earthquakes, is used to image the structure of the lithosphere beneath eacharray. A new tomographic inversion method, which uses advanced wavefront trackingtechniques to solve the forward problem of predicting arrival time residuals through a 3-Dheterogeneous model, and an efficient subspace inversion method to iteratively solve thenon-linear inverse problem, is applied to data collected from the recent TIGGER and SEALexperiments. Results from both these studies demonstrate the potential of this class ofseismic imaging in revealing important structural information beneath regions obscuredby young cover sequences. The challenge of combining all passive array datasets fromsoutheast Australia in a single tomographic inversion will also be discussed.

Keywords: Teleseismic tomography, southeast Australia, seismic array

Introduction

The idea of using multiple deployments of seismic arrays to gradually span a large region with a useful density ofseismometers was first put into practice with the SKIPPY experiment in the early 1990s. By exploiting surface wave-forms from the combined arrays, it is possible to build detailed shear wavespeed images of the Australian continent(e.g. Fishwick et al., 2005). More recently, this idea has been adopted on a grand scale by the USArray, which iscurrently attempting to progressively cover continental USA with a uniform distribution of seismometers. In southeastAustralia, the three year MALT experiment began in 1998 in western Victoria with the deployment of the LF98 array(Graeber et al., 2002), which comprises 40 short period recorders, for a period of approximately four months. Thiswas followed in 1999 and 2000 by the MB99 and AF00 arrays respectively, which have resulted in a dense coverageof stations spanning a region between Melbourne and Adelaide (see Figure 1).

In 2002, the 72 broadband and short period recorders of the TIGGER array were deployed in northern Tasmania fora five month period by the Research School of Earth Sciences, Australian National University (Figure 1). This wasfollowed by 20 short period recorders in southern New South Wales and northern Victoria in 2004-2005, also for afive month period, as part of SEAL. The most recent experiment involved the deployment of 50 short period recordersin western Victoria in September 2005 as part of the EVA experiment (Figure 1). These instruments will be in placeuntil May 2006. As Figure 1 shows, the different arrays are all geographically linked (with the obvious exception ofTIGGER), and together with a number of broadband stations from the QUOLL experiment, span a large region ofsoutheast Australia. To date, data from each array has been analysed separately, but there would be definite benefits insimultaneously inverting all available data for a unified tomographic model.

Data reduction

The group of short period seismic arrays shown in Figure 1 record ground motion at a rate of between 20 - 25 samplesper second using vertical component seismometers. Over a period of 4-5 months, large volumes of data are recorded.In the case of MALT, each of the three component arrays produced about 30 Gb of continuously recorded data. Thetotal volume of data for all arrays shown in Figure 1 is in excess of 200 Gb. The process of distilling out the information

Page 25: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

136˚

136˚

138˚

138˚

140˚

140˚

142˚

142˚

144˚

144˚

146˚

146˚

148˚

148˚

150˚

150˚

-44˚ -44˚

-42˚ -42˚

-40˚ -40˚

-38˚ -38˚

-36˚ -36˚

-34˚ -34˚

-32˚ -32˚

TFS

QUOLL (1999)

MurrayBasin

LF98 (1998)

MB99 (1999)

AF00 (2000)

TIGGER (2002)

SEAL (2004−2005)

EVA (2005−2006)

Key

Figure 1: Location of the TIGGER and SEAL arrays within the framework of other passive array deployments in theregion.

actually used in the inversion process begins by extracting data windows (usually 30 - 60 minutes long) correspondingto expected onset times of large earthquakes, which can be obtained from global catalogues. The next step involvestrying to identify the arrival of global seismic phases (e.g. direct transmissions, reflections from the outer core), andthen estimating the onset time of the associated wavetrain at each recorder. Both these processes can be done quiteefficiently in a semi-automated fashion, and result in 10s of Gb of data being reduced to 10s of Kb. A significantportion of ongoing research efforts in seismic imaging is directed towards trying to exploit more of the seismic recordthan simply the arrival time of specific phases.

Teleseismic tomography

Teleseismic tomography uses relative arrival time residuals from distant earthquake sources, recorded across an arrayof seismic stations, to image the seismic structure of the crust and upper mantle (e.g. Aki et al., 1977; Humphreysand Clayton, 1990). Relative arrival time residuals can be extracted from seismic records using cross-correlation typemethods (e.g. the adaptive stacking method of Rawlinson and Kennett, 2004) which exploit the relative invariance (orcoherence) of the teleseismic coda across a dense local array. The new teleseismic tomography method we have de-veloped to map the extracted arrival time residual patterns as 3-D perturbations in seismic wavespeed uses a non-lineartomographic procedure that combines computational speed and robustness. Structure beneath the array is representedusing a mosaic of smoothly varying cubic B-spline volume elements, the values of which are controlled by a mesh ofvelocity nodes in spherical coordinates. A grid based eikonal solver, known as the Fast Marching Method (FMM,) isused to compute traveltimes from the base of the model to the receiver array on the surface (Rawlinson and Sambridge,2004). The inverse problem, which requires the velocity node values to be adjusted in order to satisfy the observedtraveltime residual patterns, subject to damping and smoothing regularization, is solved using a subspace inversionmethod. The non-linear nature of the tomographic inverse problem is addressed by iterative application of the forwardand inverse steps.

Page 26: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Results from TIGGER and SEAL

A total of 6,520 arrival time residuals are inverted from the TIGGER dataset to produce a 3-D wavespeed perturbationmodel; for SEAL, this value is 3,085. In both cases, six iterations of the tomographic inversion method are requiredto achieve convergence. Despite the large number of velocity nodes that are inverted for in both cases (61,380 forTIGGER; 21,645 for SEAL), the computing time on a 1.6 GHz Opteron PC is approximately 20 minutes to solve thecomplete problem in each case. The reason that the SEAL inversion is no faster than the TIGGER inversion is due tothe greater number of teleseismic sources (about 50% more than TIGGER) that are used; in terms of computing time,FMM is insensitive to the number of receivers, but not to the number of sources. Figure 2a shows an E-W cross sectionthrough the TIGGER solution model with some of the principal features highlighted. Figure 2b shows a horizontalslice through the SEAL solution model with a schematic map of major regions and boundaries superimposed.

-180

-160

-140

-120

-100

-80

-60

-40

-20

0

-180

-160

-140

-120

-100

-80

-60

-40

-20

0

144 145 146 147 148 149

144 145 146 147 148 149

-300 -200 -100 0 100 200 300

Zone ofrelatively low

velocity

ElevatedcrustalvelocityEasterly dipping

structure

Longitude ( E)

δvp (m/s)

Elevatedcrustal

velocity

Smearing

SmearingDep

th (

km)

EW

TFS

41.4 S

-300-200-100 0 100 200 300

138˚

138˚

140˚

140˚

142˚

142˚

144˚

144˚

146˚

146˚

148˚

148˚

150˚

150˚

-39˚ -39˚

-38˚ -38˚

-37˚ -37˚

-36˚ -36˚

-35˚ -35˚

-34˚ -34˚

-33˚ -33˚

-32˚ -32˚GawlerCraton

StawellZone

BendigoZone

MelbourneZone

Wagga−O

meo

Complex

OrogenDelamerian Tabberabbera Zone

Tas

man

Lin

e

Lachlan Orogen

Hea

thco

te F

ault Governor Fault

Gilm

ore Fault

Moyston Fault

δvp (m/s)

Western CentralEastern

140 km

−3.72 % +3.72 %

(a)

(b)

Figure 2: (a) E-W slice through the TIGGER solution model; (b) horizontal slice through the SEAL solution model.

The location of southeast Australia relative to surrounding seismogenic zones means that path coverage is relativelydense from the north and east, but quite sparse from the south and west. Checkerboard resolution tests, which attempt torecover a synthetic 3-D pattern of alternating fast and slow anomalies using the same path coverage as the observations,show that this first order variation in path coverage has some affect on the recovery of structure. In particular, fine scalestructure (approximately equal to the average station spacing), towards the southern and western ends of both arraystend to be smeared out, although larger scale features appear to be quite well resolved.

Discussion and conclusions

Many fundamental questions regarding the structure and evolution of the lithosphere beneath south east Australiaremain unanswered, including the development of the Palaeozoic Lachlan Orogen, and the relationship between Tas-mania and mainland Australia. The results from the TIGGER experiment reveal several interesting features, including apronounced W-E increase in crustal velocity at about 147.4 E (Figure 2a). This abrupt change in wavespeed supportsthe idea that eastern Tasmania is underlain by dense rocks with an oceanic crustal affinity, while western Tasmaniacomprises continentally derived siliciclastic crust. Interestingly, the Tamar Fracture System (TFS) does not overlie theboundary between the two regions, which suggests that it is a shallow feature. The easterly dipping structure furtherwest may represent remnants of an early phase of easterly subduction during the Tyennan Orogeny in the Cambrian.

The horizontal slice through the SEAL solution model (Figure 2b) shows a dominant W-E fast-slow-fast wavespeedvariation that can be observed throughout the upper mantle between 70-250 km depth. The transition from faster toslower wavespeeds in the west is indicative of a change from Proterozoic to Phanerozoic lithosphere that has also beenobserved further south by the LF98 experiment (Graeber et al., 2002). The elevated velocities beneath the StawellZone may well be caused by the presence of Precambrian basement extending beneath the western part of the LachlanOrogen in the vicinity of the Murray Basin. The relatively fast velocities observed beneath the Wagga-Omeo Complex(Figure 2b) point to a significant change in character of the upper mantle between the central and western subprovincesof the Lachlan Orogen, although the precise reason for this change is difficult to identify.

Although all the different seismic arrays on the Australian mainland (Figure 1) form a single large array with no major

Page 27: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

gaps, it is not a simple exercise to try and relate features imaged beneath one array, with features imaged beneath anadjacent array. There are two reasons for this: (1) edge effects, which cause structure to be smeared out towards theedges of the array due to insufficient angular path coverage, and (2) the use of relative arrival time residuals in theinversion, which means that unless the average wavespeed with depth is identical beneath adjacent arrays, they willnot join continuously. Given these limitations, it is far more desirable to simultaneously invert all available data for aunified model of the entire region. Although this will require a very large tomographic inversion problem to be solved,with of the order of 30,000 ray paths and 500,000 unknowns, the new tomographic scheme we have developed has thenecessary computational efficiency and robustness.

References

K. Aki, A. Christoffersson, and E. S. Husebye. Determination of the three-dimensional seismic structure of the litho-sphere. J. Geophys. Res., 82:277–296, 1977.

S. Fishwick, B. L. N. Kennett, and A. M. Reading. Contrasts in lithospheric structure within the Australian craton -insights from surface wave tomography. Earth Planet. Sci. Lett., 231:163–176, 2005.

F. M. Graeber, G. A. Houseman, and S. A. Greenhalgh. Regional teleseismic tomography of the western LachlanOrogen and the Newer Volcanic Province, southeast Australia. Geophys. J. Int., 149:249–266, 2002.

E. D. Humphreys and R. W. Clayton. Tomographic image of the Southern California Mantle. J. Geophys. Res., 95:19,725–19,746, 1990.

N. Rawlinson and B. L. N. Kennett. Rapid estimation of relative and absolute delay times across a network by adaptivestacking. Geophys. J. Int., 157:332–340, 2004.

N. Rawlinson and M. Sambridge. Multiple reflection and transmission phases in complex layered media using amultistage fast marching method. Geophysics, 69:1338–1350, 2004.

Page 28: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Desperately Trying to Cope with the Data Explosion in Astronomical Sciences

Ray Norris

CSIRO Australia Telescope National Facility, PO Box 76, Epping, NSW 1710 ([email protected])

Abstract

Like Earth and environmental sciences, astronomy is in the middle of a data explosion, and it Is valuable to compare the way this explosion is being managed in these different fields. Astronomy has a distinguished tradition of using technology to accelerate the quality and effectiveness of science, and data-intensive initiatives such as the Virtual Observatory are In the vanguard of the scientific data revolution. However, we face a number of challenges, such as:

• Our current freedom to create open-access databases on the web is threatened by those who would like all data to be subject to strict Intellectual Property controls.

• We have excellent data centres, which are widely used, and yet most data published in journals never appears in them.

• Some data obtained with publicly-funded observatories never enters the public domain. • Major projects are started with insufficient thought given to how the data will be managed. • Our colleagues in developing countries still fail to get the electronic access to data and journals that

most of us take for granted. • We don't have mechanisms in place for managing, curating, and when appropriate digitising, valuable

legacy data from earlier observations. • We don't work sufficiently closely with, and thereby learn from, our colleagues in other fields.

But perhaps the biggest challenge is that most astronomers are unaware that these challenges exist! In some cases, there is even resistance to addressing them (e.g. "Why should I share my data with my competitors?"). To try to build awareness within the astronomical community, we have drafted an "Astronomers Data Manifesto" and are using it to trigger debate. Other fields of science face similar problems, and the ICSU (International Council of Science) is trying to forge a path forward on behalf of all areas of science. And yet, most scientists are sadly unaware of these battles being fought on their behalf.

Page 29: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Towards Service-Oriented Geoscience: SEE Grid and APAC Grid

R. Woodcock(1), R. Fraser(2)

(1) CSIRO Exploration and Mining and pmd*CRC, 26 Dick Perry Ave, Kensington, Western Australia 6151, Australia (2) IVEC, 26 Dick Perry Ave, Kensington, Western Australia 6151, Australia

Abstract

Open geospatial standards, service-oriented architectures (SOA) and grid computing enable new approaches to publishing and accessing geoscience data and programs. The linkages between three collaborating projects show how the use of standards based service interfaces and protocols is enhancing the capabilities of geoscientists. The first of these projects, “The Solid Earth and Environment Grid (SEE Grid) Roadshow 2005” demonstrates how open geospatial standards can be used to provide interoperable access to Government geoscience data held at all Australian geological surveys. The “Australian Partnership for Advance Computing (APAC) Grid Geosciences” project illustrates how computationally demanding geoscience programs can be made available as services and distributed across the APAC partners HPC and storage resources in a manner that requires limited knowledge of the physical infrastructure. Finally, the “predictive minerals discovery CRC (pmd*CRC) project” shows how these services can be chained to provide advanced modelling and interactive inversion of mineralization processes for the purposes of improved exploration targeting.

Keywords: Service-oriented architecture; grid computing; computational geoscience

Introduction CSIRO Exploration and Mining and the predictive minerals discovery CRC (pmd*CRC) use numerical modelling of geological systems to systematically explore the process-related parameters governing the formation of mineral deposits (http://www.pmdcrc.com.au/). At a high level, the workflow used is common with many other research investigations:

1) gather input data including geological and geometric properties and create the model 2) perform the computation using a suitable program and computing architecture 3) analyse the results 4) repeat if further study is required

The types of investigations undertaken are highly variable ranging from simplified geologic models to real-world scenarios. Input data includes geological observations of rock properties and chemical composition supplied by mining companies and geological surveys. Finite element meshes are created from 3D geological models often supplied by mining companies and can take 1 – 2 weeks to produce. Computation time may range from a couple of hours on a desktop PC to a couple of weeks on an HPC depending on which phenomena are simulated, the numerical solver, the computing architecture, and the number of studies required to explore the parameter space. The result is a demanding workflow with flexibility and efficiency being required at all stages. In order for the approach to be cost effective investigations are often simplified at various stages. For example, the use of interactive inversion [Boschetti 2001] can greatly improve the investigation process by “optimising” the parameters towards models that are “better” under the guidance of the geoscientist. This approach effectively eliminates large sections of the parameter space and substantially reduces computational cost. The use of template-based numerical modelling [Potma 2004] can be used to automatically create families of geometrically related models based on a single manually created model. This automation of the workflow significantly reduces the time required to create the models and improves overall productivity. However there are significant inefficiencies in the workflow that have not be eliminated. These inefficiencies occur with interactions between people, organisations and resources:

• Information scattered across multiple geological surveys and the mining companies hampers the gathering of input data. Consequently, the cost of data integration can substantially exceed all other costs. Investigators often ignore this wealth of real world observation data preferring to use “average” properties that could range by several orders of magnitude between geographic locations.

• Multiple HPC resources are available, particularly for research purposes, via the Australian Partnership for Advance Computing (APAC). However, access is often difficult due to differing queuing and data staging

Page 30: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

policies at each site. Investigators often find the cost of adapting their tools to use multiple sites prohibitive and limit their access to resources either to their own PC’s or a single HPC facility.

We are involved in several projects that, when linked, substantially address the issues of scattered information and access to multiple HPC resources in the geosciences. Further we apply these technologies to the pmd*CRC modelling of mineralization processes for the purposes of improved exploration targeting. These issues are seen across multiple domains, not just the geosciences. As a result, open geospatial standards (http://www.opengeospatial.org), service-oriented architectures (http://www.ibm.com/developerworks/webservices) and grid computing [Foster 2001] have been developed as domain-independent technologies to assist in dealing with these issues. First we will describe how open geospatial standards have been used in the Solid Earth and Environment Grid to provide access to pre-competitive geoscience data. Then we will describe our activities in the development of the APAC Grid. Finally, we will illustrate how the two previous projects are used in pmd*CRC software architecture.

The Solid Earth & Environment Grid Roadshow 2005 The Solid Earth & Environment Grid (SEE Grid) Roadshow project’s aim was to provide on demand, web service based access to geoscience information holdings at State and Territory Geological surveys using a common service interface and information model. This was to be done without alteration to the backend database technologies and private schemas of the surveys. Participants in the project included the CSIRO, Geoscience Australia, Social Change Online, all State and Territory Geological Surveys with support from AusIndustry and the Minerals Council of Australia. The implementation made use of the Open Geospatial Consortium Web Feature Service (WFS) for the common service interface [Vretanos 2004] and the CSIRO’s Exploration and Mining Mark-up Language (XMML) for the information model (https://www.seegrid.csiro.au/twiki/bin/view/Xmml/WebHome). The WFS middleware was implemented as an extension to the open source Geoserver WFS (http://geoserver.sourceforge.net/html/index.php). The system was deployed at each state and territory survey with only minor modifications being required to configure the mapping from private data sources (restricted to geochemical assay data for this project) to the community information model. Three client applications, a web based GIS, web based text report, and a commercial mining software package, were also modified to use the WFS services. At the conclusion of the project all 8 geological surveys had successfully implemented and deployed the WFS service. The client applications could then trivially interrogate all the surveys for geochemical assay information regardless of state and territory boundaries and local information models. This open standards approach was demonstrated in all state and territory capital cities to industry and government stakeholders. The overwhelming response was that the demonstration clearly showed that the issue of data integration and access can be substantially solved through the SEE Grid approach. The CSIRO and Geological Surveys are now developing production grade services for other geological data types via the SEE Grid community of practice. The full availability and improved access to this geoscience data will greatly benefit both industry and research.

APAC Grid: Geosciences The Australian Partnership for Advanced Computing, which has partners in most state and territories, provide HPC and mass storage facilities for research purposes. These facilities are managed independently and have a range of computing hardware and software systems which require users to adapt their workflow to a specific facility. In order to better facilitate access, APAC has been implementing Grid technology to provided standardised access to domain-independent services like job management, data storage, monitoring, and security. This standardisation goes some way towards improving access to the resources at the facilities regardless of the underlying software and hardware infrastructure. In addition to this, APAC is supporting the development of discipline-specific services in areas like bioinformatics, astronomy, high energy physics and the geosciences. The APAC Grid Geosciences project is developing service interfaces for commonly used algorithms and making them available at multiple locations using the standard interface. The geoscience specific services are built on top of the domain-independent grid technology and, as a result, development of the geoscience specific services is greatly accelerated. Researchers can make use of these services by chaining them together to support their workflow. As a demonstration of the approach several applications are being deveoped. A seismic simulator web portal and computational service has recently been completed. A mantle convection service based on the Snark numerical code is under development. Ultimately the goal is to service chain the mantle convection service with an independently developed interactive inversion service. Finally, the EarthByte 4D data portal will provide a service for simulation and visualisation of geological and geophysical observations coded by tectonic plate and geological time. The EarthByte data portal will retrieve the observational data from the geological surveys using the SEE Grid approach. In addition, the results of the EarthByte simulation can be chained into the mantle convection service to provide initial and boundary conditions for modelling runs. This service chaining illustrates how the use of service based architectures can support the creation of more complex systems with substantially reduced development effort. The project is also

Page 31: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

developing a “how-to” guide to assist other groups who wish to make their programs available as grid enabled services. The standardisation of interfaces across the APAC Grid will allow a greater number of researchers to access more resources. In addition, the publication of discipline-specific services that can be chained together will facilitate better access to research outcomes and further collaboration.

pmd*CRC Modelling Toolkit The pmd*CRC modelling workflow was briefly described in the introduction to this article. It is a demanding workflow, often requiring integration of disparate data, substantial computation and the flexibility to substitute numerical algorithms better suited to the investigation at hand. The developments being undertaken in the SEE and APAC Grid enable the cost effective development of a modelling toolkit to support this workflow. An example of possible service interactions is shown in Figure 1. This would be a challenging and costly development if all components needed to be developed and deployed by a single organisation. However, the SEE and APAC Grid communities are developing many of these components already. The main focus of the pmd*CRC development is on the workflow and the orchestration of the service interactions.

Figure 1: An example of the service interactions involved in the pmd*CRC modelling toolkit. With SEE Grid and APAC Grid fully deployed, only the workflow component requires development, all other services pre-exist. Services can be substituted e.g. Snark replaced with a more suitable numerical code for a given problem.

Discussion The formation of communities of practice, like the SEE and APAC Grid, who publish useful services using open standards, provides new opportunities for the publication and access to geoscience data and programs. We have shown that such services can be developed, deployed and orchestrated in order to support advanced geoscience modelling workflows.

References F. Boschetti and L. Moresi, Interactive Inversion in Geosciences, Geophysics, 64, 1226-1235, 2001 I. Foster, C. Kesselman and S. Tuecke, The Anatomy of the Grid, Intl. J. Supercomputer Applications, 2001 W. Potma, P. Schaubs and T. Poulet, Application of template-based generic numerical modelling to exploration at all scale, pmd*CRC Conference, Barossa Valley, 2004 P. Vretanos, Web Feature Service 1.1, OGC document 04-094, 2004

SEE Grid Computational Services on APAC Grid

Page 32: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Temporal explosion: the need for new approaches in interpreting and

managing geochronology data K. Sircombe(1)

(1) Minerals Division, Geoscience Australia, PO Box 378, Canberra, ACT 2601

Abstract Recent analytical improvements have provided a new wave of geochronology data, but developments in understanding of the analytical process and the application of statistically robust interpretative/visualisation tools have lagged. Some tools are emerging, but there is a strong need for further development. A fundamental concern is simply how to manage the volumes of data now being acquired. Work has begun on developing a standard for the exchange of geochronology data using XML technology within the framework of other standard developments in Australian geosciences. However, a caveat in both the development of tools and standards is that they must be usable and must be accompanied by sufficient education to enable regular use.

Keywords: geochronology, statistics, analytical tools, data management.

Introduction Geology is a four dimensional science: When did this volcano last erupt? What is the rate of crustal uplift in this area? Are the deformation and mineralising events at gold prospect A the same age as at gold mine B? Does the age of these dune fields fit the known climate record?

Geochronology – the sub-discipline that measures the age of earth materials – provides the temporal framework in which other geoscience data can be interpreted in an evolutionary context. The integration of data across geoscience sub-disciplines is increasingly an important feature of geological research, thus it is important that the strategies and tools are available to make the most effective use of these data.

This paper will focus on the radiometric geochronology methods that provide absolute ages of earth materials using the radioactive decay of isotopes. Traditionally, radiometric geochronology involves a laboratory intense procedure known as Thermal Ionisation Mass Spectrometry (TIMS) that required the meticulous dissolution, separation and measurement of individual elements and/or isotopes from suitable minerals phases such as zircon, monazite, and biotite. A single analysis can take days from start to finish. The specialist facilities required and laborious process naturally limited the amounts of geochronological data that could be produced.

Like much of geoscience, geochronology has experienced a rapid evolution in analytical capability in the last decade and also faces a data explosion. The development of methods such as Secondary Ionisation Mass Spectrometry (SIMS) and Laser Ablation Inductive Coupled Plasma Mass Spectrometry (LA-ICPMS) that can measure isotopes from areas within individual mineral grains within minutes has created this wave of data. Many facilities have become geochronology

Page 33: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

production lines that have provided a wealth of new information, but the tools for data management and interpretation – and even the analytical approach – have not evolved as rapidly.

Analysis A geochronology workflow will typically analyse several to hundreds of individual mineral grains depending on the aims of the project, the methods being applied and the type of equipment used. An individual analysis will measure a variety of isotopic ratios (e.g. lead and uranium) from which an age can be calculated. Depending on the method there can be considerable mathematical processing of the ‘raw’ data via a variety of regressions and statistical tests. The calculated age is typically in millions of years and is reported with an associated uncertainty based on uncertainties propagated from analytical counting statistics and calibrations. TIMS analyses are typically have much greater precision than SIMS or LA-ICPMS analyses. A determined age for a sample (e.g. the age of volcanic rock) is based on a statistical assessment of the analyses such as a weighted mean or a linear regression, and is reported as age ± uncertainty at 95% confidence. An illustration of this is provided in Figure 1.

Figure 1. A typical illustration of geochronological data - in this case uranium-lead isotopic data acquired using a SIMS instrument. The main diagram illustrates the interaction of two isotopic systems while the inset diagram provides a univariate probability density distribution of the ages derived from the plotted data. The determined age of the sample, in this case a weighted mean of the all the analyses except those excluded for analytical and geological reasons (solid squares), is provided in the top right. (From Sircombe, 2003).

Geochronological analyses of single mineral grains can follow two strategies (Fedo et al., 2003). The first, qualitative analysis, involves the analyst specifically selecting grains on the basis of colour, morphology, internal compositional zoning or other properties. The purpose is to identify and date all components that make up the sample, for instance a small proportion of grains with convolute internal zoning may be a different age to the majority of grains with internal oscillatory zoning. This strategy is principally employed to date material where a relatively single age or a simple collection of ages are expected such as in igneous or some metamorphic rocks.

The second strategy is quantitative analysis which involves analysing a random selection of grains from the sample. The aim is to sample an accurate representation of the total population. This strategy is principally employed to date detrital material from sedimentary rocks where a mixture

Page 34: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

of ages can be expected and the proportions of various components themselves provide information.

In quantitative analyses statistical concerns become paramount. How many analyses are enough to be sure that the ages are an accurate representation of the total population? This question was originally approached by Dodson et al (1988) which calculated a “magic” number of ~60 as producing a sample where there was a less than 5% probability that a component compromising 1 in 20 of the total would be missed. These calculations have recently been revived by Vermeesch (2004) and Andersen (2005), to much produce much larger values for statistical adequacy (typically 100+). However, a truly rigourous adequate sample size may need to be judged on the heterogeneity of the population and further work is required.

Often the purpose of analysing sedimentary rocks using a quantitative approach is to gain an idea of the age of the deposition. To this end, the youngest age in a suite of analysed grains is often treated as the proxy for the maximum of deposition for that sedimentary rock. These types of analyses have become vital in broad homogeneous sedimentary sequences where there is little other suitable material to date. However, these data sets frequently provide mathematical conundrums on how to define the youngest age in the data. Is it the solitary analysis that in a strict statistical sense is an outlier? Is it the weighted mean of the youngest n grains that can provide a statistically valid grouping? Can it be calculated via deconvolution methods (e.g. Sambridge and Compston, 1994)? Again further work is required to develop a statistically rigourous and practically acceptable method.

Interpretation and visualisation In geologically simple samples, such as a single phase igneous rock with no isotopic inheritance from older rocks, interpretation and comparison of geochronology data can be equally simple. Determined ages for samples are calculated from the analyses and can be compared using t-tests. However, the large volumes of data produced by quantitative style analyses often pose interpretative problems, particularly if they are complex, heterogeneous samples. The problem can be further compounded when attempting to compare results produced by different methods that can have widely ranging individual analytical precisions and methods for illustrating the data. The description below will focus on the display of univariate age data in probability density distributions, although there are also recent efforts to develop tools that enable visualisation to provide more ‘multi-dimensional’ information from the original data.

The traditional approach has been to simply eye-ball plots of the data to ‘see’ if there were common components or patterns in the age distributions between samples (e.g. the left-hand column of Figure 2). While this is a practical first-pass approach, it is obviously subjective and quickly become impractical beyond a few samples (some studies such as regional synthesis projects can potentially have hundreds of samples to compare). Attempts have been made to develop statistical methods for comparing and contrasting these distributions (Sircombe, 2000; Berry et al., 2001) and the most recent approach is to use kernel functional estimation (Sircombe and Hazelton, 2004).

These developments are still in their emergent phase and considerable testing and refinement of mathematical techniques are required. However, a crucial element of any such development is not mathematical, but psychological. Many geological practitioners are uncomfortable with mathematics and are particularly discouraged by complex processes with cumbersome interfaces. In such an environment an otherwise perfectly robust and valuable process can simply sink from the collective scientific consciousness if it, or rather its authors, do not provide sufficient usability and user education.

Page 35: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Figure 2. Illustration of statistical comparison techniques being applied in the interpretation of large sets of geochronological data. This case involves 25 sets of quantitative data from sedimentary rocks across central, southeast Australia and Tasmania. A more traditional approach would have been simply to eye-ball the probability density distributions illustrated in the left-hand column. (From Sircombe and Hazelton, 2004).

Data management The explosion of geochronology data has created data management issues because it has highlighted the lack of adequate standards in managing and exchanging data at all phases from acquisition to end-user. These issues are not unique to geochronology, but form part of the greater challenge of developing the technology, user processes and culture that will enable the full potential of interoperable scientific research networks to be realised. The impetus for the development of these networks within Australian geoscience is enormous as many of the major questions now facing academic, government and industry geoscientists can only be answered via the efficient and effective integration of data across many disciplines. The development of these systems has been highlighted as being of national importance (Department of Industry, Tourism and Resources, 2004; Department of Education, Science and Training, 2005).

As a contribution to the broader efforts, Geoscience Australia has initiated a project to develop a standard data format for the exchange of geochronology data based on XML (eXtensible Markup Language) technology and strongly linked to other related developments in geosciences (e.g. XMML, eXploration & Mining Markup Language, https://www.seegrid.csiro.au/twiki/bin/view/Xmml/WebHome ) and similar international efforts related to the management and usage of spatial information. This project is currently in the early phases of development and is concentrated on gathering user requirements in order to

Page 36: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

begin the development of data models. Feedback from geochronology specialists to basic users is sought in order to ensure that the user requirements are robust and widely applicable. Given the wide variety of geochronology methods and usages available developing the data models to express the relationships among the required data fields will be a complex task and any contributors will be welcome.

The ultimate aim of the project is to ensure that users of geochronological data within Australia can quickly find the data they require via an Internet portal and download it in such a format that they – or their application – can readily translate the information and make effective use of it. Again, there is a strong need to note that simply providing the technology is often not enough. Systems must have a high usability and be accompanied by sufficient education and motivation to encourage regular use beyond a small huddle of specialists.

References Andersen, T. Detrital zircons as tracers of sedimentary provenance: limiting conditions from

statistics and numerical simulation. Chemical Geology, 216: 249–270, 2005.

Berry, R.F., Jenner, G.A., Meffre, S., Tubrett, M.N. A North American provenance for Neoproterozoic to Cambrian sandstones in Tasmania. Earth and Planetary Science Letters, 192: 207–222, 2001.

Department of Education, Science and Training. An e-Research Strategic Framework: A Discussion Paper. 2005. http://www.dest.gov.au/NR/rdonlyres/F89601F7-6E10-4A2A-9E0A-E04C4AD17BB7/5864/20050602finaldiscussionpaper.pdf

Department of Industry, Tourism and Resources. The Road to Discovery: Minerals Exploration Action Agenda. 2004. http://www.industry.gov.au/assets/documents/itrinternet/Road_to_Discovery20040702155050.pdf

Dodson M.H., Compston, W., Williams I.S.& Wilson J.F. A search for ancient detrital zircons in Zimbabwean sediments. Journal of the Geological Society of London, 145: 977–983, 1988

Fedo C.M., Sircombe K.N. & Rainbird R.H. Detrital zircon analysis of the sedimentary record, in Hanchar J.M. & Hoskin P.W.O. (Eds.) Zircon: Reviews in Mineralogy and Geochemistry, 53: 277–303, 2003.

Sambridge, M.S. and Compston, W. Mixture modelling of multi-component data sets with application to ion-probe zircon ages. Earth and Planetary Science Letters 128: 373-390, 1994.

Sircombe, K.N. & Hazelton, M.L. Comparison of detrital zircon age distributions by kernel functional estimation. Sedimentary Geology, 171: 91– 111, 2004.

Sircombe K.N. Age of the Mt. Boggola volcanic succession and further geochronological constraint on the Ashburton Basin, Western Australia. Australian Journal of Earth Sciences, 50: 967-974, 2003.

Sircombe K.N. Quantitative comparison of large sets of geochronological data using multivariate analysis: a provenance study example from Australia. Geochimica et Cosmochimica Acta, 64: 1593–1616, 2000.

Vermeesch P. How many grains are needed for a provenance study? Earth and Planetary Science Letters, 224: 441–451, 2004.

Page 37: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

GPlates and GPML: Open software and standards for linkingdata to geodynamic models on the APAC grid

R. Dietmar Muller(1)

(1) School of Geosciences and University of Sydney Institute of Marine Science (USIMS), Edgeworth David

Building F05, Eastern Ave., Main Campus, The University of Sydney NSW 2006, Australia

Abstract

Unravelling the evolution of planet Earth as well as resource exploration depend on ourability to link many different types of observations and models to each other in a platekinematic context. However, no common tool is available to track the time history of ge-ological and geophysical data broken up into plates, or to simultaneously display modelsfor mantle dynamics in a plate tectonic framework. To overcome this obstacle to synthe-sizing and modelling Earth processes, we are developing a plate kinematic/geodynamicinformation model, the GPlates Markup Language (GPML), and GPlates software to createa universal standard for plate reconstructions, linked to both commonly used data basesand geodynamic models. GPlates/GPML combines well designed tools for data integra-tion and visualization with a powerful mathematical backend that allows researchers toeasily acquire, investigate, manipulate and distribute plate tectonic data and link them togeodynamic models. We focus on two examples of linking kinematic data to models: (1)modelling the current and paleostress field of the Australian continent via automatic op-timization using Abaqus and Nimrod, based on a combination of continental and oceanicgeophysical and geological data and (2) combining a relative and absolute plate motionmodel with a regional plate tectonic data base to restore the geometry and velocities ofplates through time for linking to a 3D mantle convection model.

Keywords: plate tectonics; information model, continental paleostress, mantle convection

Page 38: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

Towards a Geoscience Information Commons: the Electronic Geophysical Year, 2007-2008 and the Global Earth Observing System of Systems

C. Barton(1), Alex Held (2)

(1) Research School of Earth Sciences, Australian National University, Canberra, ACT 0200 ([email protected]) (2) CSIRO Office of Space Science and Applications, GPO Box 3023, Canberra, ACT 2601 ([email protected])

Abstract

The bad news is that we are stressing the natural resources of our planet and we need comprehensive, multidisciplinary data and information about the Earth and its space environment for sustainable management and hazard mitigation.

The good news is that modern information and communications technologies (interoperability), combined with comprehensive Earth observation programs and a spirit of cooperation among governments, give us an unprecedented ability to collect and share data and information. Two initiatives that are contributing towards the ideal of a 'Geoscience Information Commons' are eGY and GEOSS.

The Electronic Geophysical Year, 2007-2008 (eGY) is an initiative of the International Union of Geodesy and Geophysics that marks the 50-year anniversary of the International Geophysical Year. eGY provides an international cooperative environment and mandate for addressing issues of ready and open access to data, information, and services, including data discovery, data release, data preservation, data rescue, reducing the digital divide, and education and public outreach. These issues are the formal themes of eGY, and are embodied in the principles set out in the eGY 'Declaration for a Geoscience Information Commons'. Promoting the development of virtual observatories is a central feature of eGY.

Through a series of three Earth Observation Summits, representatives of some 60 leading national governments are committed to establishing a Global Earth Observation System of Systems. GEOSS will build on existing and planned Earth observation programs and data systems. The main objectives are (i) to identify and fill gaps in our observing capability, and (ii) to achieve ready, open, and timely access to shared data and information.

Both initiatives provide opportunities for geoscientists, but they also challenge us to face the practicalities of managing and sharing large, multidisciplinary data sets so as to facilite open, convenient, and timely access.

Keywords: eGY, GEOSS; interoperability; Information Commons Websites: www.egy.org

Page 39: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

On Software Infrastructure for Computational Earth Sciences

L. Gross, J. Smillie, E. Thorne

Earth Systems Science Computational Centre (ESSCC), The University of Queensland, Brisbane, Australia

Abstract

Simulation software are build with three layers: the user interface, the mathematical mod-els and the numerical methods. The mathematical layer provides ab straction from the nu-merical techniques and their implementation, the user interface pr ovides abstraction frommathematical models. Each of the layer has a particular terminology being used, requiresspecial skills fro the user to work within this layer and consequently needs particular com-putational tools. Appropriate tools for implementing numerical tec hniques is C/C++, forprogramming mathematical models is a scripting language such as pyt hon and for userinterfaces are input files and graphical user interfaces. In this talk we will present the con-cepts of the mathematical modelling language escript and will show how escript is linkeddownwards into numerical numerical and upwards into user interfa ces.

Keywords:Software Infrastructure; Mathematical Modelling, FiniteElement Method;

Introduction

Conceptional the three layers ”user interface”, ”mathematical models” and ”numerical algorithms” found in numericalsimulation codes are independent. For instance information needed to describe a salinity scenario in a specific regionis independent from the mathematical calculus used to modelsalinity. The model on the other hand should be generalenough to treat the relevant cases and is independent from finite elements (FEM), finite differences or finite volumesused to discretise the relevant partial differential equations. Moreover, the model is independent from the actual com-pute platform and the actual implementation of the discretization method which may be even platform dependent.Numerical algorithms are computational intensive and, to achieve sufficient code efficiency and scalability, have to beimplemented in C/C++ to work closely to the underlying hardware. However, for the modeller working within themathematical layer handling model complexity (coupling, time-dependency, non-linearity) and the flexibility to easilymodify and test models rather than the code efficiency is of major concern. An object oriented scripting language, suchaspython(3), is appropriate for this layer. The end user of a verified model does not want to see the mathematical for-mulation involved in the model but wants to apply the model inhis/her particular context in order to analyse and predictthe behaviour of his/her system. A file, typically in XML format, is the appropriate way for providing a description ofthe problem. The file may be created through a graphical user interface or a web service.It is pointed out that each of the layer has its individual terminology: The numerics layer uses terms like floating pointnumbers and data structures. The mathematical layer talks about functions and partial differential equations. The userinterface uses terms like stress, temperature and viscosity. Moving from one layer to the other requires a translation:data like viscosity and temperature become coefficients in partial differential equations and coefficients become arraysof floating point numbers distributed across the processorsin parallel machine.Various tools have been developed to create graphical user interface and web services, some of them using graphicaluser interfaces themselves. Also a lot of work has been done on tool supporting the efficient and portable implementa-tion of numerical methods. However, only very little work has been spent in the area of tools for the implementationof mathematical models. In this paper we will present the basic ideas of the modeling environmentescript(2) and howescript is linked with numerical as well as the user interface layer.We will focus on the context of partial differentialequations (PDEs).

Modelling Environment

In order to make use of existing technologiesescriptis an extension of the interactive scripting environmentpython. Itintroduces two new classes, namely theData class and thelinearPDE class.Objects of theData class define quantities with spatial distribution which arerepresented through their values onsample points. Examples are a temperature distribution given through its values at nodes and a stress tensor at quadra-ture points in the elements of a finite element mesh. Inescriptscalar, vector and tensorial quantities up to order4 aresupported. Objects can be manipulated by applying unitary operations (for instancecos ,sin, log) and be combined byapplying binary operations (for instance+, − ,∗, /). A Data object is linked with a certain interpretation provided

Page 40: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

by the numerical library in which context the object is used.If neededescript invokes interpolation during data ma-nipulation. Typically, this occurs in binary operations when the arguments defined in a different context or when dataare passed to a numerical library which requires data to be represented in a particular way, such as a FEM solver thatrequires the PDE coefficients on quadrature nodes.A linearPDE object is used to define a general linear, steady, second order PDE for an unknown functionu on thedomainΩ. In tensor notation, the PDE has the form

−(Aijkluk,l + Bijkuk),j + Cikluk,l + Dikuk = −Xij,j + Yi , (1)

whereuk denotes the components of the functionu andu,j denotes the derivative ofu with respect to thej-th spatialdirection. A general form of natural boundary conditions and constraints can be considered. The functionsA, B, C,D, X andY are the coefficients of the PDE and are typically defined byData objects. When a solution of the PDEis requested,escriptpasses the PDE to the solver library which returns aData object representing the solution by itsvalues, for instance, at the nodes of a FEM mesh. Currentlyescriptis linked with the FEM solver libraryfinley(1) butother libraries and even other discretization approaches can be included.The following python function incompressibleFluid implements a simplified form of the penalty iterationscheme for a viscous, incompressible fluid. It takes the PDE domaindom, the viscosityeta and the internal forceFas arguments:

def incompressibleFluid(dom,eta,F):E=Tensor4(0,ContinuousFunction(dom))for i in range(dom.getDim()):

for j in range(dom.getDim()):E[i,i,j,j]+=PeE[i,j,i,j]+=etaE[i,j,j,i]+=eta

mypde=LinearPDE(dom)mypde.setValue(A=E,Y=F)p=Scalar(0,Function(dom))while Lsup(vkk)>tol:

mypde.setValue(X=kronecker(dom)*p)v=mypde.getSolution()vkk=div(v)p-=Pe*vkk

return v,p

The statementdiv(v) returns the divergencevk,k of v. The function returns velocityv and pressurep. The tensorE and the the pressurep are introduced with different attributesContinuousFunction() andContinuous()defining a different degree of ”smoothness”. This mathematical concept of smoothness is implemented through differ-ent representations of values. In case of FEM, the tensorE would typically be hold at the nodes of the FEM mesh whilethe pressure is stored on the quadrature points. The solver library and the discretization method to be used to solve thePDE is defined by the domaindom.

Model Interfaces

TheLinearPDE class provides the interface fromescriptdownwards into the numerical algorithm layer. To builduser interfaces models are wrapped bypythonclasses which are subclasses of theescriptModel class. The mainfeature of aModel class object is the ability to execute a time step for a given suitable step size which is chosen as theminimum step size over all models involved in the simulation. Moreover, model parameter such as viscosityeta andexternal forceF in the example of the incompressible fluid are ”highlighted”. They can be linked with parameters ofother models and can be exposed in an XML input file to assign values to them for instance through a graphical userinterface.If the classIncompressibleFlow implements a model of an incompressible fluid andMaterialTable is aModel class for a simple material table providing values for a temperature-dependent viscosity one uses

flow=IncompressibleFlow()mat=MaterialTable()mat.temperature=1000flow.eta=Link(mat,"viscosity")

Page 41: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

to link instances of the two classes. At any time of the simulationIncompressibleFlowwill use the value providedby theMaterialTable object at that moment. The capability ofescriptto know about the context of data and toinvoke data conversion when required is vital to make this very simple form using models actually working. This scriptcan be represented as an XML file which can be edited, for instance to change the value for the temperature, and thenbe used to recreate the script for the new configuration.In case of a Mantel convection simulation we would like to introduce a temperature dependent viscosity. If theTemperature class provides an implementation for temperature advection-diffusion model the following statementslink this model with the incompressible flow model

temp=Temperature()temp.velocity=Link(flow,"v")mat.temperature=Link(temp,"T")

We assume here thatv is the velocity provided by the flow model andT is the temperature of the temperature model.Instead of apythonthe link between the models can be established through an XMLdescription.The order in which the models perform there times steps is critical. TheSimulation class which in this example isused in the form

Simulation([flow,mat,temp]).run()

will make sure that incompressible flow is updating its velocity before the temperature model is performing the nexttime step. The viscosity is calculated from the temperatureof the previous time step.TheSimulation can be serialized into an XML file. The simulation can be started directly from the file. This opensthe door of turning models into services in a grid environment. In the presented modelling environment appropriateinterfaces can be built automatically. Suitable tools for building graphical user interface and web services automaticallyfrom the XML simualtion file are currently under construction.

Acknowledgements

This work is supported by the Australian Commonwealth Government through the Australian Computational EarthSystems Simulator Major National Research Facility, Queensland State Government Smart State Research FacilityFund, The University of Queensland and SGI.

References

[1] Davies, M. and Gross, L. and Muhlhaus, H. -B.: Scripting high performance Earth systems simulations on theSGI Altix 3700.Proceedings of the 7th international conference on high performance computing and grid in theAsia Pacific region, (2004).

[2] Gross, L. and Cochrane, P. and Davies, M. and Muhlhaus, H. and Smillie J.: Escript: numerical modelling inpython.Proceedings of the Third APAC Conference on Advanced Computing, Grid Applications and e-Research(APAC05),(2005).

[3] http://www.python.org [October 2005].

Page 42: Environmental Applications of Data Miningrses.anu.edu.au/cadi/Whiteconference/abstracts/abstracts_booklet.pdf · Environmental Applications of Data Mining Saˇso Dzˇeroski Department

How to avoid collateral damage : Principles for linking data users to data providers

M. Feeney(1), J. Busby(2)

(1) (2) Australian Government - Office of Spatial Data Management, GPO Box 378 CANBERRA ACT 2601

Abstract The data explosion places a premium on good practice in data management. Priority data needs to be specified, managed by identified and responsible custodians, comprehensively documented and made discoverable through metadata, made compliant with relevant standards, be accessible, and provided with minimal constraints. Without agreed and effective processes for managing data in networked environments: duplication, loss of knowledge of data quality, difficulties in accessing fitness for purpose, confusion over authoritative data sources and other inefficiencies will strangle large-scale, multi-agency, cross-disciplinary projects.

The Office of Spatial Data Management (OSDM) has been established by the Australian Government to implement its Spatial Data Access and Pricing Policy. In essence, we focus on lowering the barriers to access and use of data required or held by Australian Government agencies. While scientists within any particular subject domain can usually rely on informal peer networks to assess the quality and relevance of data and usually experience few problems with negotiating access to suitable data, this can break down in cross-disciplinary environments, especially those involving government or private sector data owners. Drawing on experience with collaborative multi-thematic and multi-jurisdictional data networks in Australia and internationally, OSDM will outline the relevance of and good-practice solutions to issues such as custodianship, metadata, interoperability and other data standards, data access and licensing.

Keywords: custodianship, metadata, interoperability, standards, licensing.