14
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170 Published online 22 May 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1192 Integrating and mining distributed environmental archives on Grids M. Zhizhin 1, ,† , E. Kihn 2 , R. Redmon 2 , A. Poyda 3 , D. Mishin 1 , D. Medvedev 1 and V. Lyutsarev 4 1 Geophysical Center, Russian Academy of Sciences, Moscow, Russia 2 National Geophysical Data Center, NOAA, Boulder, CO, U.S.A. 3 Moscow State University, Russia 4 Microsoft Research, Cambridge, U.K. SUMMARY The solar-terrestrial physics distributed database for the ICSU World Data Centers, and the NCEP/NCAR climate re-analysis data have been integrated into standard Grid environments using the OGSA-DAI framework. A set of algorithms and software tools for distributed querying and mining environmental archives using the UNIDATA Common Data Model concepts has been developed. In addition, the toolkit enables querying the data using meaningful ‘human linguistic’ terms. Copyright c 2007 John Wiley & Sons, Ltd. Received 24 October 2006; Revised 1 February 2007; Accepted 22 February 2007 KEY WORDS: very large databases; common data model; time series; data mining; fuzzy logic; data Grids; space physics; weather re-analysis; environmental informatics INTRODUCTION Environmental informatics combines the research fields of Artificial Intelligence, Geographical Information Systems (GIS) Modeling and Simulation and User Interfaces and is a rapidly expanding area of computer and natural science [1]. The increasing data volumes from today’s collection systems and the need of the scientific community to include integrated and authoritative representation of the natural environment in analysis requires a new approach to data mining, management and access [2]. The natural environment includes elements from multiple domains such as space, terrestrial weather, Correspondence to: M. Zhizhin, Geophysical Center, Russian Academy of Sciences, 119991 Moscow, Russia E-mail: [email protected] Contract/grant sponsor: Microsoft Research Cambridge Contract/grant sponsor: Russian Foundation for Basic Research grant; contract/grant number: 04-07-90362 Copyright c 2007 John Wiley & Sons, Ltd.

Integrating and mining distributed environmental archives on Grids

Embed Size (px)

Citation preview

Page 1: Integrating and mining distributed environmental archives on Grids

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2007; 19:2157–2170Published online 22 May 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1192

Integrating and miningdistributed environmentalarchives on Grids

M. Zhizhin1,∗,†, E. Kihn2, R. Redmon2, A. Poyda3,D. Mishin1, D. Medvedev1 and V. Lyutsarev4

1Geophysical Center, Russian Academy of Sciences, Moscow, Russia2National Geophysical Data Center, NOAA, Boulder, CO, U.S.A.3Moscow State University, Russia4Microsoft Research, Cambridge, U.K.

SUMMARY

The solar-terrestrial physics distributed database for the ICSU World Data Centers, and the NCEP/NCARclimate re-analysis data have been integrated into standard Grid environments using the OGSA-DAIframework. A set of algorithms and software tools for distributed querying and mining environmentalarchives using the UNIDATA Common Data Model concepts has been developed. In addition, the toolkitenables querying the data using meaningful ‘human linguistic’ terms. Copyright c© 2007 John Wiley &Sons, Ltd.

Received 24 October 2006; Revised 1 February 2007; Accepted 22 February 2007

KEY WORDS: very large databases; common data model; time series; data mining; fuzzy logic; data Grids; spacephysics; weather re-analysis; environmental informatics

INTRODUCTION

Environmental informatics combines the research fields of Artificial Intelligence, GeographicalInformation Systems (GIS) Modeling and Simulation and User Interfaces and is a rapidly expandingarea of computer and natural science [1]. The increasing data volumes from today’s collection systemsand the need of the scientific community to include integrated and authoritative representation of thenatural environment in analysis requires a new approach to data mining, management and access [2].The natural environment includes elements from multiple domains such as space, terrestrial weather,

∗Correspondence to: M. Zhizhin, Geophysical Center, Russian Academy of Sciences, 119991 Moscow, Russia†E-mail: [email protected]

Contract/grant sponsor: Microsoft Research CambridgeContract/grant sponsor: Russian Foundation for Basic Research grant; contract/grant number: 04-07-90362

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 2: Integrating and mining distributed environmental archives on Grids

2158

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

oceans and terrain. Systems such as the Global Change Master Directory (GCMD) from NASA [3] orthe Master Environmental Library (MEL) [4] from the DOD Defense Modeling and Simulation Officeand others provide the ability to search metadata by keywords for links to archived environmental datasets distributed across the network, but the ability to search for specific scenarios (sets of conditions)within the environmental data does not yet exist.

At the same time, the environmental modeling community has begun to develop several archivesof continuous environmental representations. These archives contain a complete view of the Earthsystem parameters over a regular Grid for a considerable period of time. The numerical modelsused to reproduce environmental parameters take all available observational data as initial conditions,so the resulting petabyte-size data sets jointly may be considered as authoritative high-resolutionrepresentations of terrestrial weather and the near-Earth space during the last 50 years [5,6].

The first generation of data Grids, including the Space Physics Interactive Data Resource(SPIDR) [7], the Earth Observations component of the European DataGrid project [8], the astronomicalSkyServer [9] and the Earth System Grid [12], on climatology and ocean in the early 2000s haveaddressed the problem of environmental data sharing using Grid middleware, high-performancedatabase clusters and high-speed network connections for different Earth sciences user communitiesand geographical regions. Expanding cross-discipline and international data exchange requires morevirtualization, interoperability and standardization efforts in that field.

The Grid community often considers scientific data as collections of semantically and syntacticallysimilar granules or files. There exist several self-describing data models and (networked) file formatsfor the granules: CDF, NetCDF, HDF, etc. In this sense the data access means a rapid transfer of a largenumber of files from the nearest Grid container. For example, 2-Micron All Sky Survey project resultsin 10 TB of data spread over 5 × 106 files. One can use secure GridFTP protocol [10] and efficientStorage Resource Brokers [11] to handle such mass data transfers. However, this approach still doesnot allow interactive data mining for environmental events within the large number of files. However,from our experience we see that the end users often need access to event-related subsets of data andthey simply have not enough resources to download, store and mine very large datasets locally. In thispaper we present a novel approach to integrate a parallel data mining engine directly into the storagecontainer so that users can conditionally subset the data within container as an optional alternative tothe parallel download of tons of data.

We present the SPIDR [7] data Grid services for World Data Centers [13] which has nodes inthe U.S., Russia, Japan, China, Australia, India, South Africa and France, and propose the next stepof integration based on the OGSA-DAI middleware for management and data mining of distributedterabyte-scale archives for climate, space physics and remote sensing. We use a data source Grid serviceabstraction layer to virtualize sequential databases providing time series for our search engine. The datasource interface is implemented as an OGSA-DAI [14,15] component with a simple output XMLschema. Time series selected from the data source in XML format can be mined directly or transformedinto another format with XSLT [16] to be used by another client, for example by a Microsoft Excelspreadsheet. We show that XML output format with GZIP data compression [17] requires CPUtime and network bandwidth comparable to the NetCDF binary file serialization. Compliance withthe OGSA-DAI specification and use of Java/J# language allowed us to deploy our data source andmining services into most of the existing Web service and Grid service containers including MicrosoftASP.NET [18], Apache Tomcat/Axis [19], WSRF Globus Toolkit 4 [20], OMII [21] and EGEEgLite [22].

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 3: Integrating and mining distributed environmental archives on Grids

2159

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

ENVIRONMENTAL ARCHIVES ON GRIDS

For the parallel mining of a set of conditions inside distributed, very large databases from multipleenvironmental domains we have developed a software toolkit called Environmental Scenario SearchEngine (ESSE). The prime requirement of the ESSE system design is to allow the user to query theenvironmental data archives in human linguistic terms. The mapping between human language andcomputer systems involves fuzzy logic. Imagine, for example, that the end user does not need all ofthe weather data covering the Moscow region for the last 60 years, but rather needs an example of anatmospheric front near Moscow. Furthermore, imagine that this user needs satellite images of the frontand wants to know how often such fronts occur or if they have been increasing in the last 10 years.The ESSE search engine and data mining portal will address such problems.

The rest of the paper has two main parts, data integration and data mining. In the next section wedescribe the existing SPIDR data Grid and present performance tests of the new integration platformfor authoritative environmental data sources based on XML and OGSA-DAI. In the third section wedefine the environmental event scenario, introduce mathematics of the ESSE fuzzy data mining andpresent interactive data mining use case for meteorology and remote sensing. The last section offersconclusions and directions for future work.

INTEGRATION OF ENVIRONMENTAL ARCHIVES ONTO GRID

We define the Grid of environmental data sources as a set of geographically distributed Web servicessharing security model, dynamic service registry and metadata scheme, data request interfaces andthe output data model. This is in line with the general Grid approach towards virtualization of data,services and interfaces [23]. Behind the Web service we can store the environmental data in a filesystem as binary files or images, in a relational database as rows of observations or as another Webservice possibly with a different service contract. Each storage method and structural organizationof the dataset will require a specific implementation of our virtual data source Web service, but forthe Grid user all of them will look like the same Common Data Model structures containing theenvironmental data contents, such as parameters, stations or coordinate Grids and observation timeintervals.

We have been developing this concept of the virtual environmental data source for some time already,starting with distributed Web services and portals for space physics, meteorological and simulationcommunities. A Web-portal serves as an agent between the user and the Grid of environmental datasources. It performs two main functions. The first function is metadata management, which allows forfast and efficient metadata search. Here by metadata we mean general descriptions of data sources,stored as a managed collection of XML documents (owner info, geographic coverage, time coverage,data description, visualization methods, etc.). The metadata can be updated either manually by the useror automatically by a data robot which collects records from the data Grid (see Figure 1). Our metadatacollection works much the same as other similar resources, for example, GCMD or MEL, that werementioned earlier. For more detailed discussion on the role of metadata in distributed data networkssee Nieto-Santisteban et al. [24].

The second function of the Web portal is data access. In Figure 1 the Web portal is shown as aclient, which connects to numerous data sources, retrieves the requested data and delivers it back tothe user. Advanced Web-portal functions might include visualization and data mining. Data access

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 4: Integrating and mining distributed environmental archives on Grids

2160

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

Data adminMetadata

Web

Service

Datasource

New

Datasource

Web

Portal

(client)

Register Resource

Add to List Data Resources

Robot

Update

Metadata from

Resource List

Workability

Check

User

Search Datasource

by Metadata

...

Query Management

Data Access

Figure 1. Web portal as a client for Grid of data sources.

Web forms are built using inventory (low-level) metadata describing the availability of the station’s—satellites—instruments for the given time interval. The inventory metadata can also be used to compareand synchronize mirrored data sources.

Space physics interactive data resource

Since 2001, the SPIDR has been a de facto standard data source for solar-terrestrial physics, functioningwithin the framework of the ICSU World Data Centers [13]. It is a distributed database and applicationserver network, built to select, visualize and model historical space weather data distributed across theInternet. One can use SPIDR as a fully functional Web application (portal) or as a Grid of Web services,providing functions for other applications to access its data holdings.

Currently SPIDR archives include solar activity, solar wind data, geomagnetic, ionospheric, cosmicray, radio-telescope ground observations, telemetry and images from NOAA, NASA and DMSPsatellites. SPIDR nodes are installed in the U.S., Russia, China, Japan, Australia, South Africa andIndia each holding about 1 TB of space weather data with a 20 000 member user community and stableload of about 100 sessions per day.

The SPIDR portal combines a central metadata repository with a set of distributed data Web services,Web map services and replicated sets of data files. A user can search the metadata inventory, use apersistent data basket to save information between sessions and plot or download the selected datain different formats, including XML and NetCDF. Database administrators can upload files into theSPIDR databases using either Web services or the Web-portal interface. SPIDR databases at othernodes pull these changes automatically so that nodes remain synchronized.

Clients can load data into SPIDR databases from files located on the user’s workstation, withloading options and data passed to the SPIDR Web services using the SOAP with attachments method.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 5: Integrating and mining distributed environmental archives on Grids

2161

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

ENVIRONMENTAL ARCHIVES ON GRIDS

The database loading Web service called by the client will parse the input file format, load data into thelocal database, add a bookkeeping record into SPIDR’s data input/output logging database and checkthe list of mirrored SPIDR nodes, sending the input data file to mirror sites keeping their databases insync.

Weather reanalysis data resource

The NCEP/NCAR weather reanalysis data archive [5] was derived from the numerical weatherprediction Global Circulation Model (GCM). This model uses data ingest procedures to assimilateobservational data into model results to produce a consistent picture of the terrestrial weather ona regular time step and fixed spatial Grid. The spatial resolution for the NCEP/NCAR reanalysisis 2.5◦(latitude) × 2.5◦(longitude) × 100(parameters) ≈ 106 Grid points. The ‘high-frequency’ GCMoutputs the data for that Grid every six hours of simulation time, resulting in ∼400 MB of data persimulation day. By contrast, the worldwide daily meteorological observational data collected over theGlobal Telecommunications System, is ∼200 MB. The NCEP/NCAR re-analysis project [5] has runthe GCM for more than a 50 year simulation period, providing ∼8 TB of global weather data.

To accelerate the typical data requests for the ESSE search engine, we have developed a specialparallel database cluster and optimized the database schema. Each cluster node stores several years ofNCEP/NCAR reanalysis data. For example in a 10-node cluster, the first node stores data for years1950, 1960, . . . ; the second node stores years 1951, 1961, . . . , and so on. For each year we have aseparate database, and one table for each parameter such as temperature or pressure. The table has asimple structure: latitude, longitude, height (optional), and a blob of data with a floating-point timeseries for one year at one latitude–longitude–height location (Figure 2). The data records are indexedby location.

OGDSA-DAI environmental data services

The implementation of an environmental data access system that incorporates the ESSE searchengine had been based on the OGSA-DAI framework [14,25], the emerging standard for representingdatabases in Grids.

OGSA-DAI is a middleware product that supports the representation of various data resources,such as relational or XML databases, onto the Internet and Grids. The basic abstraction introduced inOGSA-DAI is the notion of a data resource that is able to perform data access and data transformationactivities. Typically, each database is represented as a separate data resource, but the concept is generalenough to represent heterogeneous databases also. Data resources may differ in a set of activities theyare able to perform. For example, a data resource representing a relational database may execute SQLqueries while XPath queries may be submitted to a data resource representing an XML database.The advantage of OGSA-DAI is that clients use standard Web or Grid service protocols to submitqueries and obtain results. In addition, data resources may be orchestrated in such a way that result setsfrom one data resource connect directly as input to another data resource.

For each data resource, OGSA-DAI exposes a Web service endpoint, to which Web service messagescan be addressed. The message with a perform operation contains an XML perform document as itsparameter and returns an XML response document as a result. The perform document describes one ormore activities. A data resource performs different kinds of queries, transformations and deliveries or

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 6: Integrating and mining distributed environmental archives on Grids

2162

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

Figure 2. Data stored on the NCEP/NCAR cluster.

manipulation operations depending on the activity types and parameters. Each activity may have inputand output channels that may be linked with each other within one perform document. Thus a OGSA-DAI perform document describes a simple workflow to be performed by the data resource. Specialtypes of delivery activities enable linking channels on remote OGSA-DAI servers. In a real systemthe portal application that yields the user interface creates the perform document, invokes it to remoteOGSA-DAI server and displays the result to a user.

The relational data model proposed in 1970 by Codd [26] and its implementation in the form ofSQL DBMSs with possible fuzzy logic extensions [27], so successful in business applications, is notuniversally adopted for environmental data archives. Petabyte data products [5,6] are still delivered inthe form of directory tree/file collections because the sequential file structure such as NetCDF [28]or HDF [29] is closer to the time-series data model than a set of related rows from two-dimensionaltables. The UNIDATA [30] THREDDS server with OpenDAP network data access protocol attemptsto aggregate different file formats under a single array-oriented Common Data Model. This ongoingunification effort currently does not support an XML format for data export and is not compatible withthe emerging e-Science Data Grid standards [14,24,25]. Thus we had to create a separate OGSA-DAIdata resource component and a set of corresponding activity components (Table I).

The data from the OGSA-DAI service may come to a user in different formats. We have implementedthe binary NetCDF format [28] commonly used in environmental sciences and a more general-purposeXML format. Unlike business applications, in the environmental sciences domain the XML is not yet

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 7: Integrating and mining distributed environmental archives on Grids

2163

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

ENVIRONMENTAL ARCHIVES ON GRIDS

Table I. ESSE components added to the OGSA-DAI.

Component Description

EsseDataResource Represents environmental databaseGetMetadataActivity Query activity. Returns the description of the data maintained by the

EsseDataResourceGetXmlDataActivity Query activity. Returns one or several time series from the

EsseDataResourceGetNetCdfActivity Query activity. Serializes a data subset into a NetCDF file and returns a URL

to that fileFuzzySearchActivity Transformation activity. Receives one or more time series from

GetXmlData and returns fuzzy membership function values

recognized as a standard for data transfer. This is due to generally observed larger file sizes and greaterprocessing times for XML data compared with data in binary format. File sizes are even more importantin distributed systems where large transfers may easily saturate the network.

Several standardization efforts exist to eliminate this problem. One of the approaches is to createan alternative encoding format that allows efficient interchange of the XML Information Set [31,32].This requires dramatic changes in all the XML infrastructure and still lacks interoperability andsupport. Another more practical approach is to combine existing standards. The resulting XML-binaryOptimized Packaging recommendation (XOP) [33] enables binary wire transfer of large-size base64encoded XML content.

In the present paper we advocate using standard compression algorithms to reduce the size of XMLencoded data. The OGSA-DAI framework already contains activities to compress and decompress datathat flows through a channel. The compressed data is included in the response document using base64encoding, which makes the transfer even more efficient when Web services endpoints support XOP.

To corroborate this opinion Table II compares the amount of data transferred from the server to theclient for the same query but with different output options. The first option was to serialize data in thebinary NetCDF file. The second option was to serialize data in an XML document. The third option wasto compress XML data from getXmlData activity using gzipEncode activity and serialize the outputwith base64 encoding. In all the cases we send to the client 50 years of data in the point near Moscowfor three parameters (pressure and components of wind speed) with 6 hours time step. You can see thatpayload of compressed XML data in this case is even less than the size of pure binary files.

MINING DISTRIBUTED ENVIRONMENTAL ARCHIVES

People often use qualitative notions to describe such variables as temperature, pressure and wind speed.In reality, it is difficult to put a single threshold between what is called ‘warm’ and ‘hot’. In this sectionwe describe implementation of the ESSE that allows for evaluating such qualitative queries on the Gridof distributed environmental archives. Fuzzy set theory serves as a translator from vague linguisticterms into strict mathematical objects.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 8: Integrating and mining distributed environmental archives on Grids

2164

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

Table II. Data load for binary and XML data serialization.

Activity Description Load (KB)

getNetCdfData Binary NetCDF file 567getXmlData Response document containing data in XML format 2442getXmlData and Response document containing GZIP compressed 327gzipCompression and then base64 encoded XML data

Environmental scenarios

Fuzzy sets were introduced by Zadeh [37] in the 1960s as a superset of conventional (Boolean) logicthat has been extended to handle the concept of partial truth to model the uncertainty of naturallanguage. Currently we support fuzzy linguistic (large, small), numerical (less than, in the rangebetween) and causal (before, after) terms to query for events described as sequences of states ofenvironment at some locations such as Grid point or stations, or along a spatial trajectory. The spatial(near, far in space) and temporal (close, far apart in time) reasoning, as well as inverse path ofknowledge discovery to learn an event scenario from data, are in our plans.

The base data model in our study is a vector-valued time-series that can be seen as a trajectory in theM-dimensional phase space. For example, in Figure 3 we have a two-dimensional trajectory in the airpressure–temperature (P–T) space. Using the dualism between set theory and logic, we call the ‘state’S any sub-region of the phase space that can be described by a fuzzy logic expression on predicatesdescribing the parameter values in each dimension in numerical or linguistic terms [34]. In Figure 3the state S1 corresponding to the upper-right region can be described by the fuzzy expression:

S1 = (Very Large(P)) AND (Very Large(T))

where the linguistic term Very Large() is a predicate, and the operator AND stands for the fuzzylogic conjunction. In the same way, the state S2 corresponding to the lower-left region is

S2 = (Very Small(P)) AND (Very Small(T))

Now, combining the fuzzy set descriptions of the states with the ‘time shift’ operatorShift(dT, ) to describe transitions between the states, we can write a symbolic expression forthe environmental scenario ‘a day with very low temperature after a day with very high temperatureand pressure’:

Scenario = (Shift(dT=1 day,S1)) AND (S2).

The only pair of observations in Figure 3 that fit the above scenario is the pair (t1,t2).Our environmental scenario search engine, ESSE, is designed to mine for the phase space transitionslike that in the very large scientific databases.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 9: Integrating and mining distributed environmental archives on Grids

2165

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

ENVIRONMENTAL ARCHIVES ON GRIDS

T min T max

P min

P max

tmin

tmax

t1

t2

t3

t4Very

Small

Very

Small Very

Large

Very

Large

T

P

Figure 3. Time-series as a trajectory in the 2D phase space (P, pressure, T, temperature).

Fuzzy logic predicates

A classical set A in a space of objects U can be defined by its indicator function IA(u) : U → {0, 1},which is equal to 1 for all elements u from the set A and to 0 otherwise. A fuzzy set expresses degreeto which an element belongs to a set. Hence the indicator function of the fuzzy set is allowed to havevalues between 0 and 1, which denotes degree of membership of an element in a given set. A fuzzy setA in U is defined by its membership function (MF) μA(u) : U → [0, 1], which maps each element ofU to its membership grade between 0 and 1.

The ESSE conditions are built from nine standard membership functions that are mapped to thefollowing linguistic terms: Very Small, Small, Average, Large, Very Large, About a value, Less Than avalue, Greater Than a value and Between two values. The default ESSE library of MFs uses the genericbell ‘mother’ function [34]:

μg bell(x; a, b, c) =[

1 +∣∣∣∣ x − c

a

∣∣∣∣2b]−1

Here, x stands for normalized for range [0,1] scalar data variable, c stands for center of the symmetrical‘bell’, a for its half-width and b/2a controls its slope. We use here simple range normalization for thevariable x:

x = x − xmin

xmax − xmin

where xmin and xmax stand for the minimal and maximal observable values of x, respectively.

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 10: Integrating and mining distributed environmental archives on Grids

2166

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

00.20.40.60.8

1

0 0.2 0.4 0.6 0.8 1

Very small Small Average

Large Very Large

Figure 4. Membership functions of the ESSE ‘linguistic terms’.

Five MFs for a linguistic term set {‘very small’, ‘small’, ‘average’, ‘large’, ‘very large’} are plottedin Figure 4.

Basic operations of classical set theory (union, intersection, complement) and correspondingoperations of mathematical logic (or, and, not) can be generalized for fuzzy sets and fuzzy logicin many different ways. The fuzzy generalization of intersection (logical ‘and’) is usually called aT-norm operator, generalization of union (logical ‘or’) is called a T-conorm or S-norm operator and thegeneralization of logical ‘not’ is called a fuzzy complement operator.

One of the simplest generalizations from classical to fuzzy set theory for two MFs μA(u), μB(u) isto use minimum of MFs for the intersection of fuzzy sets (fuzzy logic ‘and’)

μA∩B = min(μA, μB)

maximum of MFs for fuzzy sets union

μA∪B = max(μA, μB)

and one complement for fuzzy set complement

μA = 1 − μA

In 1980 Yager [35] introduced a parametric family of T-norms, T-conorms and fuzzycomplements [36]. Parameterized by q ≥ 1 a family of fuzzy ‘and’ aggregations for two MFs is definedby Yager’s T-norm operator:

TY(μA(x), μB(x), q) = 1 − min{1, [(1 − μA(x))q + (1 − μB(x))q ]1/q}More general formula for the parametric Yager’s T-norm operator for fuzzy ‘and’ aggregation of anyM > 1 MFs μm(x), m = 1 . . . M is

TY(μm(x), q) = 1 − min

{1,

[ M∑m=1

(1 − μm(x))q]1/q}

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 11: Integrating and mining distributed environmental archives on Grids

2167

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

ENVIRONMENTAL ARCHIVES ON GRIDS

The resulting surface of values for the multi-dimensional MF is more smooth than using a simpleminimum of the aggregating MFs, which is the limit case of Yager’s T-norm for q = 1. Parametricfamily of fuzzy ‘or’ aggregations for two MFs is described by Yager’s T-conorm operator

SY(μA(x), μB(x), q) = min{1, [(μA(x))q + (μB(x))q]1/q}More general formula for Yager’s ‘or’ aggregation of any M > 1 MFs is

SY(μm(x), q) = min

{1,

[ M∑m=1

(μm(x))q]1/q}

Yager’s fuzzy complements are defined by formula

NY(μ(x), q) = (1 − (μ(x))q)1/q

The ESSE is designed to support different libraries of T-norm, T-conorm and complement operators.The results below are obtained using the Yager’s formulas with the order q = 5.

Environmental scenario search engine

We have added a data mining activity fuzzySearchActivity to the OGSA-DAI service asan implementation of the ESSE [38]. The activity element in the service ‘perform document’ is acombination of fuzzy conditions on environmental parameters’ values localized in space and time; infact it is an XML-formatted environmental scenario description. The activity output is a time serieswith values between 0 and 1 for the fuzzy likeliness of the occurrence of the scenario at every momentin time.

The fuzzySearchActivity mining activity is the data transformation activity which is notlinked to a specific type of data resource. This makes the whole data mining system extremely flexible.One can search the environmental scenario over several parameters stored in a local database. This isaccomplished by combining several query activities with the fuzzySearchActivity activity in asingle workflow within one perform document. In a more advanced scenario it is possible to combinea data search from several OGSA-DAI resources.

The highest scores in the fuzzySearchActivity output can be used as indications of singleoccurrences of the searching event. For example, in the searching for a scenario, which was describedin previous section, at the Grid point nearest to Moscow for the months of June in years 1950–2005gives the event of 13–14 June 2002 with the air temperature and pressure plotted in Figure 5. In thatcase meteorological satellite images can be used for independent verification of the weather conditionsin the event, found by the search engine. The DMSP [41] satellite F13 images for the event 13–14 June2002 are presented in Figure 6.

Output scores that are above a given threshold can be used as a filter for another time series.For example, search results for the Low(Cloud Cover) condition can be used as a filter over asequence of the satellites images.

CONCLUSIONS AND FUTURE WORK

As more and more data archives become available through projects such as the Earth System Grid [12]of the DOD, the Comprehensive Large Array-data Stewardship System [39] of NOAA, the Earth

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 12: Integrating and mining distributed environmental archives on Grids

2168

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

1 2

Figure 5. Air temperature (top) and pressure (bottom) in Moscow for the event of 13–14 June 2002.

Observing System Data and Information System [40] of NASA and other network accessible datasystems, the tools to extract information from them become more important. The ESSE can help userssift through the vast quantities of data available online and point at the interesting bits. This means thateven with the volume of data increasing so rapidly and the number of researchers remaining relativelylevel we can hope to extract the most valuable information from the observations and carry that backto the relevant scientific communities.

The application of fuzzy-logic-based data tools goes far beyond simple event selection. For example,an ever present issue when dealing with these large data sets is quality control. There is simply too largea volume to reasonably screen by hand. Using techniques such as peer-matching and expert systemswe can extend the ESSE to monitor data and alert data managers to changes and anomalies. As thecomputational power available expands we can extend the system into areas such as data classification

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 13: Integrating and mining distributed environmental archives on Grids

2169

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

ENVIRONMENTAL ARCHIVES ON GRIDS

Figure 6. Satellite images in the infrared (left) and visible (right) optical bands for theevent of 13–14 June 2002. (From [7].)

whereby we can identify modes of the environment and perhaps identify new unknown relations inspecific regions.

Finally, the emergence of a network infrastructure for data access is providing new opportunitiesfor the scientific researcher. It is now fairly trivial to reach out across discipline boundaries and accessdata in an immediately useable format. For example, this is true in the case of the terrestrial weathercommunity being able to make use of the space data made available through SPIDR. With theseopportunities come challenges. As researchers expand into domains in which they may not be expertthey will come to rely on intelligent tools to support them.

REFERENCES

1. Hilty LM, Page B, Radermacher FJ, Riekert WF. Environmental informatics as a new discipline of appliedcomputer science. Environmental Informatics—Methodology and Applications of Environmental Information Processing,Avouris NM, Page B (eds.). Kluwer: Dordrecht, 1995; 1–11.

2. SzalayA, Gray J. 2020 computing: Science in an exponential world. Nature 2006; 440(7083):413–414.3. Earth Science Data and Services Directory: Global Change Master Directory Web site.

http://gcmd.nasa.gov [19 October 2006].4. Master Environmental Library homepage. https://mel.dmso.mil [19 October 2006].5. Kalnay E et al. The NCEP/NCAR 40-year reanalysis project. Bulletin of the American Meteorological Society 1996;

7(3):437–471.6. Uppala S et al. The ERA-40 re-analysis. Quarterly Journal of the Royal Meteorogical Society 2005; 131:2961–3012.7. Space Physics Interactive Data Resource. http://spidr.ngdc.noaa.gov [19 October 2006].8. The DataGrid project. http://eu-dataGrid.web.cern.ch/eu-dataGrid [19 October 2006].9. SDSS SkyServer DR4. http://cas.sdss.org/dr4/en [19 October 2006].

Copyright c© 2007 John Wiley & Sons, Ltd.

Page 14: Integrating and mining distributed environmental archives on Grids

2170

Concurrency Computat.: Pract. Exper. 2007; 19:2157–2170DOI: 10.1002/cpe

M. ZHIZHIN ET AL.

10. GridFTP protocol. http://www.globus.org/Grid software/data/Gridftp.php [22 December 2006].11. SRB: Storage Resource Broker. http://www.sdsc.edu/srb/index.php/Main Page [22 December 2006].12. Earth System Grid. https://www.earthsystemGrid.org [19 October 2006].13. World Data Center System. http://www.ngdc.noaa.gov/wdc [19 October 2006].14. Karasavvas K, Antonioletti M, Atkinson MP, Hong NPC, Sugden T, Hume AC, Jackson M, Krause A, Palansuriya C.

Introduction to OGSA-DAI services. Proceedings of the 1st International Workshop on Scientific Applications of GridComputing, Beijing, China, 20–24 September 2004 (Lecture Notes in Computer Science, vol. 3458), Herrero P, Perez MS,Robles V (eds.). Springer: Berlin, 2004; 1–12.

15. Antonioletti M et al. The design and implementation of Grid database services in OGSA-DAI. Concurrency andComputation: Practice and Experience 2005; 17(2–4):357–376.

16. XSL Transformations (XSLT). W3C Recommendation REC-xslt-19991116, 1999. Available at: http://www.w3.org/TR/xslt.17. Deutsch P. GZIP file format specification version 4.3. Request for Comments 1952, IETF, 1996. Available at:

http://www.ietf.org/rfc/rfc1952.txt.18. ASP.NET developer center. http://msdn.microsoft.com/asp.net [19 October 2006].19. WebServices—Axis. http://ws.apache.org/axis [19 October 2006].20. Globus Toolkit homepage. http://www.globus.org/toolkit [19 October 2006].21. OMII: Open Middleware Infrastructure Institute U.K. http://www.omii.ac.uk [19 October 2006].22. gLite: Lightweight middleware for Grid computing. http://glite.web.cern.ch/glite [19 October 2006].23. Easton J. Geographically dispersed Grid, part 1: Aligning computation and data. IBM DeveloperWorks on Grid Computing,

2004. Available at: http://www-128.ibm.com/developerworks/Grid/library/gr-gdg1/index.html.24. Nieto-Santisteban MA, Gray J, Szalay AS, Annis J, Thakar AR, O’Mullane W. When database systems meet the Grid.

Proceedings of the 2nd Biannial Conference on Innovative Data Systems Research, Asilomar, CA, 4–7 January 2005;154–161. Available at: http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=786.

25. Paton N, Atkinson M, Dialani V, Pearson D, Storey T, Watson P. Databases access and integration services on the Grid.Technical Report UKeS-2002-03, U.K. e-Science Programme, 2002. Available at:http://www.nesc.ac.uk/technical papers/dbtf.pdf.

26. Codd EF. A relational model of data for large shared data banks. Communications of the ACM 1970; 13:377–387.27. Petry F. Fuzzy Databases, Principles and Applications. Kluwer: Dordrecht, 1996.28. NetCDF (network common data form). http://www.unidata.ucar.edu/software/netcdf [19 October 2006].29. The HDF Group home page. http://www.hdfgroup.org [19 October 2006].30. Unidata Program Center. http://www.unidata.ucar.edu [19 October 2006].31. Fast Infoset. ITU-T RECOMMENDATION X.891, 2005.32. Efficient XML Interchange Working Group. http://www.w3.org/XML/EXI/ [19 October 2006].33. XML-binary optimized packaging. W3C recommendation REC-xop10-20050125, 2005. Available at:

http://www.w3.org/TR/xop10/.34. Jang JSR, Sun CT, Mizutani E. Neuro-Fuzzy and Soft Computing. Prentice-Hall: Englewood Cliffs, NJ, 1997.35. Yager R. On a general class of fuzzy connectives. Fuzzy Sets and Systems 1980; 4:235–242.36. Yager R. On the measure of fuzziness and negation, part I: Membership in the unit interval. International Journal of

Man–Machine Studies 1979; 5:221–229.37. Zadeh L. Fuzzy sets. Information and Control 1965; 8:338–353.38. Zhizhin M, Poyda A, Mishin D, Medvedev D, Kihn E, Lyutsarev V. Scenario search on the Grid of environmental data

sources. Technical Report MSR-TR-2006-72, Microsoft Research, 2006. Available at:http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=1116.

39. NOAA’s CLASS—Comprehensive Large Array-data Stewardship System. http://www.class.noaa.gov [19 October 2006].40. NASA Earth System Science Data and Services. http://nasadaacs.eos.nasa.gov [19 October 2006].41. DMSP: Defence Meteorological Satellite Program. http://www.ngdc.noaa.gov/dmsp [22 December 2006].

Copyright c© 2007 John Wiley & Sons, Ltd.