12
Computers & Geosciences 34 (2008) 387–398 Automating geospatial metadata generation—An integrated data management and documentation approach James K. Batcheller Edinburgh Earth Observatory, Institute of Geography, School of GeoSciences, University of Edinburgh, Drummond Street, Edinburgh EH8 9XP, UK Received 15 January 2007; received in revised form 30 March 2007; accepted 4 April 2007 Abstract Geospatial metadata have long played an important role in the management of geospatial datasets. Often employed by institutions to organise, maintain and document their geographic resources internally, metadata may also provide a vehicle for exposing marketable data assets externally when contributed to on-line geospatial exchange initiatives. In spite of the numerous benefits it affords, obstacles to the production of such geospatial surrogates are numerous. The current work proposes an approach aimed at reducing the effort associated with geospatial metadata generation through the customisation of a proprietary Geographical Information System (GIS). By coupling data preparation, management and documentation approaches with such a bespoke application, it is intended to mitigate impediments to geospatial metadata generation whilst promoting a system of data administration that safeguards the data it supports. The current prototype, implementing an extended Dublin Core geospatial profile of 23 elements, was capable of generating a total of 20 basic metadata entries. While the findings do not suggest a dispensability of human mediation in the authoring process, they do support the view that a dataset’s ambient computing infrastructure has the potential to play a significant role in automating the creation of geospatial metadata. r 2007 Elsevier Ltd. All rights reserved. Keywords: Geospatial metadata; Metadata authoring; Metadata generation; Data management; Data documentation 1. Introduction Since their appearance in the latter half of the twentieth century, the proliferation of Geographical Information Systems (GIS), their applications and related technologies has continued apace (Goodchild and Haining, 2004). With the more current devel- opments in the realm of web-enabled geospatial services, as well as the emergence of popular Geo- graphical Exploration Systems (GES) such as Google Earth and Microsoft’s Virtual Earth, increas- ing numbers of people continue to be introduced to the possibilities afforded by such technologies. Whether for public, private or academic purposes, demand for geographical information (GI) has in addition increased several-fold in recent times, with those looking to procure data turning to the exploration of existing data pools, commissioning the collection of new data, or resorting to producing their own. Efforts to meet this demand contribute to the explosion of data currently available, ARTICLE IN PRESS www.elsevier.com/locate/cageo 0098-3004/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2007.04.001 Tel.: +44 131 6502558; fax: +44 131 6502524. E-mail address: [email protected]

Automating geospatial metadata generation—An integrated data management and documentation approach

Embed Size (px)

Citation preview

Page 1: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESS

0098-3004/$ - se

doi:10.1016/j.ca

�Tel.: +44 1

E-mail addr

Computers & Geosciences 34 (2008) 387–398

www.elsevier.com/locate/cageo

Automating geospatial metadata generation—An integrateddata management and documentation approach

James K. Batcheller�

Edinburgh Earth Observatory, Institute of Geography, School of GeoSciences, University of Edinburgh,

Drummond Street, Edinburgh EH8 9XP, UK

Received 15 January 2007; received in revised form 30 March 2007; accepted 4 April 2007

Abstract

Geospatial metadata have long played an important role in the management of geospatial datasets. Often employed by

institutions to organise, maintain and document their geographic resources internally, metadata may also provide a vehicle

for exposing marketable data assets externally when contributed to on-line geospatial exchange initiatives. In spite of the

numerous benefits it affords, obstacles to the production of such geospatial surrogates are numerous. The current work

proposes an approach aimed at reducing the effort associated with geospatial metadata generation through the

customisation of a proprietary Geographical Information System (GIS). By coupling data preparation, management and

documentation approaches with such a bespoke application, it is intended to mitigate impediments to geospatial metadata

generation whilst promoting a system of data administration that safeguards the data it supports. The current prototype,

implementing an extended Dublin Core geospatial profile of 23 elements, was capable of generating a total of 20 basic

metadata entries. While the findings do not suggest a dispensability of human mediation in the authoring process, they do

support the view that a dataset’s ambient computing infrastructure has the potential to play a significant role in

automating the creation of geospatial metadata.

r 2007 Elsevier Ltd. All rights reserved.

Keywords: Geospatial metadata; Metadata authoring; Metadata generation; Data management; Data documentation

1. Introduction

Since their appearance in the latter half of thetwentieth century, the proliferation of GeographicalInformation Systems (GIS), their applications andrelated technologies has continued apace (Goodchildand Haining, 2004). With the more current devel-opments in the realm of web-enabled geospatialservices, as well as the emergence of popular Geo-

e front matter r 2007 Elsevier Ltd. All rights reserved

geo.2007.04.001

31 6502558; fax: +44 131 6502524.

ess: [email protected]

graphical Exploration Systems (GES) such asGoogle Earth and Microsoft’s Virtual Earth, increas-ing numbers of people continue to be introduced tothe possibilities afforded by such technologies.Whether for public, private or academic purposes,demand for geographical information (GI) hasin addition increased several-fold in recent times,with those looking to procure data turning to theexploration of existing data pools, commissioningthe collection of new data, or resorting to producingtheir own. Efforts to meet this demand contributeto the explosion of data currently available,

.

Page 2: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESSJ.K. Batcheller / Computers & Geosciences 34 (2008) 387–398388

introducing further problems relating to the man-agement of quite often voluminous data holdings,and how such assets can be successfully exploited.1

As more data and information is produced, themore vital approaches become for managing andlocating such resources (Gobel and Lutze, 1998); therole geospatial metadata assumes here has beenwidely acknowledged2 (Kim, 1999; Tsou, 2002;Limbach et al., 2004). Apart from providing ameans of documenting a dataset’s key statistics suchas its quality, appropriateness, currency or area ofcoverage, metadata can supply information on theavailability of the data it describes, how it may beaccessed and exchanged; it contributes towards datamanagement efforts by helping to organise, main-tain and locate data resources; when collated intocatalogues, metadata collections can be indexed forrapid query, contributed to data clearinghouses orsimilar data exchange initiatives where they can beused to externally expose marketable data assets; itaids in the coordination of data procurement effortsby raising the awareness of extant datasets, therebyavoiding duplication of effort, redundant storageand obscuring search results.

Further incentives for its use arise when theimplications of neglecting metadata entirely areconsidered. Some claim that the cost of not creatingmetadata can outweigh that of authoring it, citingconcerns associated with employee turnover, dataredundancy, conflicts and inappropriate decision-making.3 Others meanwhile go so far as to arguethat data is rendered useless in the absence of anymetadata (Qi et al., 2004). Despite the arguments,obstacles to the adoption of metadata practicesremain. Many view its generation as monotonousand time consuming, a labour-intensive processwhich is a major undertaking in itself (Guptill, 1999;West Jr. and Hess, 2002), resulting in a pervasiveoutlook which shuns metadata creation (Mathys,2004). Streamlining conventional authoring pro-cesses, and thereby conserving associated resources,would mitigate the barriers to data documentation.

1Yuan, M., Buttenfield, B., Gahegan, M.N., and Miller, H.,

2001. Geospatial Data Mining and Knowledge Discovery. http://

ags.ou.edu/�myuan/papers/mining.pdf.2Kacmar, C., Jue, D., Stage, D., and Koontz, C., 1995.

Automatic Creation and Maintenance of an Organizational

Spatial Metadata and Document Digital Library. http://csdl.

tamu.edu/DL95/papers/kacmar/kacmar.html.3Deng, Y., 2002. The Metadata Architecture for Data

Management in Web-based Choropleth Maps. http://www.cs.

umd.edu/projects/hcil/census/JavaProto/metadata.pdf.

The negative perceptions of metadata practicescan persist even once they have been adopted, oftenwith harmful consequences for the quality ofoutput. Even where its value is recognised, datadocumentation commonly takes low priority inrelation to other activities, reduced to being seenas ‘‘a necessary evil.’’4 And as conventionalgeospatial dataset documentation remains a largelymanual process, it tends not only to be tedious whenfinally undertaken, but also error prone (Leidenet al., 2001). Considering that large volumes of datacurrently on offer emanate from those not tradi-tionally considered to be geospatial data produ-cers,5 questions arise as to whether theaccompanying metadata (when present) consistentlyreflects that which it purports to document.

The current work proposes an approach aimed atreducing the effort associated with geospatialmetadata generation. Further, by combining datapreparation, filing and documentation workflowswithin a combined framework, barriers to thecreation of geospatial metadata can potentially belowered while simultaneously enforcing a system ofdata organisation designed to safeguard such assets.Regardless of application domain, it is contendedthat facilitating the accelerated location, retrievaland interpretation of an organisation’s data hold-ings thought the use of metadata can serve to realisethe potential of (frequently underexploited) geospa-tial resources. The paper is structured to provide areview of previous work and leads to the details ofthe proposed framework; the findings are discussedthereafter, followed by the conclusions drawn.

2. Related work

2.1. Digital library and information science

community

Given the rapid and continual growth of acces-sible digital resources observed since the advent ofthe World Wide Web, it is unsurprising that effortsto facilitate effective information location, naviga-tion and retrieval through resource documentationhave followed. The digital library and Internetcataloguing arenas have hosted a number of

4Vermeij, B., 2001. Implementing European Metadata Using

ArcCatalog—ArcUser Online. http://www.esri.com/news/arcuser/

0701/metadata.html.5Schweitzer, P.N., 1998. GIS and Metadata—Putting Meta-

data in Plain Language. www.geoplace.com.

Page 3: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESS

6The Federal Geographic Data Committee.

J.K. Batcheller / Computers & Geosciences 34 (2008) 387–398 389

research initiatives investigating automated meta-data generation, motivated by the view that it is‘‘unrealistic to depend on traditional humanlygenerated metadata approaches’’ when consideringthe volumes of resources involved (Greenberg et al.,2006).

Greenberg (2003) elaborates a framework formetadata generation for online content, noting thepart standards play in guiding metadata authoringin addition to the roles of human and computingresources. Automated practices therein are cate-gorised into those which employ resource contentindexing i.e. are not predicated on the presence ofrecognised metadata elements, and those employedby commercial search engines, whether using pre-formed metadata or that produced at run-time.Liddy et al. (2001) suggest that such automatedtechniques can produce reasonable results in certaincircumstances; Anderson and Perez-Carballo (2001)maintain that automated methods tend to be moreefficient, consistent and inexpensive than humanones. Whatever the proposed method, most agreethat automated and manual approaches combinedpromise the most in producing quality resourcedocumentation (Craven, 2001; Greenberg, 2004).

2.2. The geospatial domain

2.2.1. Geospatial data

The very nature of geospatial data dictates asomewhat different approach than those mentionedabove. GI data tend to be both highly structuredand manifested in a variety of forms, characterisedby the presence of some treatment for geometry.Storage techniques vary—even within proprietarysystems—from hybrid models that store spatial andattribute information separately across differentfiles to integrated strategies employing relationaldatabases (Batcheller et al., 2007). Most geospatialstorage formats therefore do not lend themselves tothe same probing operations as used with textualresources given their content’s relative lack ofaccessibility.

The lack of sophisticated support for metadatawithin pioneering geospatial storage strategiesmeant that any important information not encodedwithin a dataset needed to be documented else-where. Externalising metadata in discrete text filesnot only bypassed the need for opening often largedocuments in their host applications when certaindataset attributes were sought; it would also enablethe use of existing indexing and cataloguing

techniques for data location and managementpreviously mentioned. Authoring tools consistedof common text editors with metadata oftenrecorded in ad hoc or institution-specific conven-tions with few common guidelines and little provi-sion for interoperability.

2.2.2. Geospatial metadata standards

More recently, geospatial data documentationefforts have been underpinned by the use ofstandards, viewed as important keys in facilitatingmetadata exchange, interpretation by individualsand manipulation by machines. Rising from theinitial foundations laid by early data standardinitiatives (Moellering, 1992), geospatial metadatastandards, like their mainstream counterparts, aimto define metadata content and structure. Contentstandards describe a ‘‘common set of terminologyand definitions for the documentation of digital(geospatial) data’’ (FGDC, 1998), while metadataencoding standards, commonly implemented usingXML Document Type Definitions (DTD) andXML Schemas, outline how content is manifesteddigitally. Used in tandem, these standards in manyways simplify metadata generation, offering atemplate for content as well as providing guidelinesfor permitted input.

Considering the detail to which common geospa-tial metadata conventions are elaborated however,standards can also have a simultaneously detri-mental effect, complicating metadata generationand potentially undermining implementation initia-tives (Tsou, 2002). In the United States, theFGDC’s6 Content Standard for Digital GeospatialMetadata (CSDGM) of 1998 outlines a standardcontaining over 300 data and compound elements(FGDC, 1998). The recently ratified ISO 19115:2005 standard for Geographic Information mean-while details a metadata element set of over 400(ISO, 2005). Clearly the benefits afforded by fullcompliance to either standard will be significantlyoutweighed by the resources necessary to achieve it.

Metadata standard profiles have consequentlyarisen for a variety of application domains, thecreation of which may themselves be guided byformalised standards such as ISO 19106:2006Geographic Information—Profiles (ISO, 2006).Essentially subsets of a given metadata convention,profiles define a limited set of elements designedfor a specialised purpose whilst simultaneously

Page 4: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESSJ.K. Batcheller / Computers & Geosciences 34 (2008) 387–398390

maintaining standard compliance, often simplifyingthe metadata authoring in the process. Profiles havefor instance been developed to enable data location(discovery metadata), to help potential users makedecisions on a dataset’s appropriateness (explora-

tion metadata) and to facilitate data utilisation(exploitation metadata) (Taylor, 2004). Similarly,region-specific profiles abound, such as the Euro-centric ISO profiles overseen by CEN7 (Longhorn,2005). While evidently useful for reducing the effortof documenting data, profiles on their own makeonly a minor contribution in improving the issuesrelating to metadata authoring efficiency.

9For example the XML-based Geography Mark-up Language,

2.2.3. Geospatial metadata creation tools

The adoption of standardised approaches tometadata creation, not least the growing popularityof XML, has made the development of generic toolsto aid such practices worthwhile. And with theadvent of national and international data exchangeinitiatives,8 increasing focus has been lent towardsthe development of tools that encourage theproduction of consistent, conformant metadata.

Early stand-alone desktop tools provided themetadata author with an interface for entry, withstandards-based output directed to either file orrelational database storage schemas (cf. askGIraffe’s

customised Microsoft Access tool (Foy, 2001)).Later versions incorporated on-edit or on-exporterror trapping—using domain lists, DTD or XMLschema validation—as well as providing support formetadata parsing, importation and (rudimentary)conversion. Crafted predominantly in the Java orVisual Basic development environments, such toolsare characterised by their independence from theproprietary applications commonly used to createand edit the data to be documented. As such, thesemetadata editors can also function as viewers,permitting the browsing of key (recorded) datasetthemes without the need of a GIS suite (West Jr.and Hess, 2002). Examples of such editors includeGIS-tec’s Metadata InGeo EntryTool (Limbach etal., 2004) and GIgateway’s MetaGenie Desktop

(Batcheller and Gittings, 2006).Whilst desktop editors may serve to produce

metadata for use both within an organisation andbeyond, online editors are predominantly used as

7The European Committee for Standardisation/Comite Eur-

opeen de Normalisation.8Notable examples include the FGDC’s National Geospatial

Data Clearinghouse in the US and GIgateway in the UK.

components of geospatial data clearinghouses andGIS portals as mechanisms for metadata contribu-tion. Used to streamline the submission process,editors such as EDINA’s Go-Geo! Metadata Creator

(Mathys, 2004) and G-portal’s XML Metadata

Resource Editor (Lim et al., 2005) also serve to helpreduce metadata redundancy and replication asrecords are typically edited where they are hosted.In addition, both strategies are often enhancedthrough the use of context-sensitive help, tool-tipsand option lists designed to guide user input andimprove metadata quality.

An important trade-off of employing such in-dependent, cross-platform editors is the disconnectof metadata authoring practices from proprietaryapplication workflows. Dataset creation and editingare consequently detached from metadata creationand editing procedures, necessitating diligent up-date practices involving at minimum two separateapplications. Countering this by providing an inte-grated workflow through which both data andmetadata can be maintained will clearly counterthis disconnect, thereby minimising the risk ofinconsistency.

2.3.4. Geospatial metadata applications

The advantages of leveraging GIS applications toaid data documentation go further than workflowconsolidation. Geospatial suites provide largelyunhindered access to data stores they support, animportant consideration if streamlining metadataauthoring through automation is to be achieved.And while more accessible open data formats arenot dismissed,9 the majority of data in productionenvironments are held in proprietary data stores—ifpublished market share figures are to be believed.10

Furthermore, considering the inclusion of program-ming kits within most common GIS applications (inaddition to their near-uniform support for disparatedata formats) development of bespoke tools isgreatly facilitated.

Due to the lack of metadata support amongst theforerunning data stores, early GIS-native toolsfocussed on metadata extraction or derivationtechniques, i.e. where dataset attributes are minedand transformed for use as metadata items.

or GML.10Market research firm IDC estimated the market share figure

held by open-source desktop software in 2002 at approximately

3% (GIS Monitor, 12 June 2003—http://www.gismonitor.com/

news/newsletter/archive/061203.php).

Page 5: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESSJ.K. Batcheller / Computers & Geosciences 34 (2008) 387–398 391

Approaches commonly involved executing an ap-plication script to extract and export datasetinformation (e.g. projection details and boundingcoordinates) to file for subsequent validation.Nevertheless, external text editors were still requiredfor the manual completion of each metadata recordproduced. A typical example is the FGDCME-TA.AML tool written in Arc Macro Language(AML) for ESRI’s ArcInfo and designed for usewith the FGDC’s CSDGM standard.

As the perceived importance of metadata in-creased, GIS vendors started to introduce enhancedsupport for metadata both within their softwareofferings and alongside their data models. Nativesupport for metadata content and schema stan-dards, often manifested as XML, became a standardfeature within many software offerings, as did theability to edit and author metadata in-package.Many dataset properties were now treated asspecific metadata items, and as a consequence couldnow also be harvested directly where before theywere derived. In addition, with the near-universaladoption of XML-based technologies and increas-ing reliance of vendor-specific programming envir-onments on common development platforms suchas Sun Microsystem’s J2EE and Microsoft’s .NET,sophisticated turnkey extensions to existing GISsoftware now became a real possibility.

3. The present work

The current paper proposes that the efficiency ofgeospatial metadata generation can be significantlyenhanced through proprietary software customisa-tion and considered data preparation. Apart fromrecognising its position as market leader in the GISsoftware field (and thereby offering a familiarplatform in which to present the current approach)ESRI’s ArcGIS 9.1 was chosen due to its extensive,extensible architecture based on modular program-ming components (ArcObjects) with which softwarecan be rapidly developed. In addition, its ArcCata-log component ‘‘provides a framework for theimplementation of a custom metadata environ-ment,’’11 and thus presents an existing toolkit withwhich to build. The development platform em-ployed was Microsoft’s .NET, chosen to exploitboth the platform’s support for solution extensi-

11Vermeij, B., 2001. Implementing European Metadata Using

ArcCatalog—ArcUser Online http://www.esri.com/news/arcuser/

0701/metadata.html.

bility, but in particular its tight integration withXML (Stephens and Hochgurtel, 2002).

A prototype was built in Visual Basic .NET andcompiled into a dynamic linked library (dll) filewhich is registered with the ArcGIS application.Although ArcCatalog provides near-uniform accessto a range of data storage techniques, a single modelwas employed to limit the degrees of freedom of thecurrent analysis. The personal geodatabase, anintegrated single-user solution based on Microsoft’sAccess RDBMS technology, was accordingly se-lected due to the relative ease in which it isconfigured as well as its positioning between(legacy) hybrid single-user file-based data storesand integrated multi-user database strategies.

The tool is designed to provide an integratedapproach to metadata generation, based on asystematic data management structure and facilitat-ing efficient data documentation, metadata valida-tion and basic translation. Being native toArcCatalog it is bound with the dataset initialisa-tion, configuration and management workflow; itmay however be readily retooled for use in ArcMapfor applications where binding metadata creationwith the data editing and analysis workflow ispreferred.

On nomination of a dataset within ArcCatalogthe tool is initiated via a standard button interface.The user is presented with a metadata editing formwhich functions as the principal interface of thetool. Pre-formed metadata items held alongside thedataset are instantaneously harvested on form load;elements may either be overwritten manually,completed if absent or selected for revision usingthe tool’s metadata routines. Routines may be runcollectively or individually, allowing full user con-trol over the tool’s operations. The operations thetool support (illustrated in Fig. 1) include:

Harvesting pre-existing metadata elements gen-erated by ArcCatalog. � Extracting file hierarchy, data and dataset

properties and attributes for use as metadataelements.

� Harvesting user-prepared metadata templates. � Guiding the visual inspection, modification and

completion of metadata records through thestructured presentation of record fields on anediting form.

� Enabling the importation from and exportation

to other standards through the use of a basicmetadata crosswalk.

Page 6: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESS

Fig. 1. Flow diagram of current metadata prototype, drawing elements from the various sources. H: harvesting; E: extraction; X: export;

U: update.

J.K. Batcheller / Computers & Geosciences 34 (2008) 387–398392

3.1. Metadata standard

A Qualified Dublin Core profile with geospatialrefinements was defined to provide a succinctelement set with which to test the metadataprototype (Table 1). Offering a widely adoptedconvention used for depicting any category ofresource, Dublin Core was chosen as it affords asufficiently sparse and manageable solution withinthe current context. And as a profile of a well-defined ISO standard, the set’s collection ofelements is readily mapped to other more detailedgeospatial conventions.

3.2. Initial harvesting

Initial harvesting takes advantage of the inherentmetadata already collected from registered datasetsby ArcCatalog. Stored in ISO-compliant12 XMLalongside the data, prototype routines harvest the

12Specifically, ISO 19115 metadata. Storage in FGDC

CSGDSM format is also supported. ArcGIS support for ISO

19139 Geographic information—Metadata—XML schema

implementation started with version 9.2.

required elements contained therein using XPathexpressions defined in a custom metadata crosswalkfile. Primarily used for cross-mapping conventionsas later described, the file details the addresses (inXPath) of elements contained within the dataset’sdefault XML metadata which are retrieved, offeringan initial set of fields on which to add. In the currentcontext, seven out of the 23 elements are auto-matically generated: Title, Language, Date Created,Format, Type, Coverage projection and boundingcoordinates as well as Identifier.

3.3. Extraction routines

Custom routines are used to extract furtherinformation from the dataset, its data content aswell as the dataset’s location within a refined folderhierarchy. The latter is based on the premise thatefficient data management practices employ logi-cally organised data stores. Here, metadata entitiesare used to organise the very data they describe,providing a nomenclature with which datasets maybe labelled, categorised and filed. In the currentconfiguration, personal geodatabases, their contentsand the folders in which they reside are tagged

Page 7: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESS

Table 1

Qualified Dublin Core element set used in the current work

Core

element

name

Element refinement Description

Title – Title

Alternative Alternative title

Description Abstract A brief narrative summary

Language Language

Subject Keywords Main theme(s)

Date Created Date of creation

Modified Last date of update

Period.name Name of a specific interval.

Used here to define

frequency of update

Creator Originating person/

organisation

Publisher Distributing person/

organisation

Contributor Contributing person/

organisation

Format Digital manifestation of

resource

Type Dataset Nature of content

Rights Access Rights Access restrictions

Coverage Spatial.Box.name Name of geographic extent

of dataset

Spatial.Box.projection Spatial reference system of

dataset

Spatial.Box.northlimit Limits of dataset extent in

coordinates

Spatial.Box.eastlimit

Spatial.Box.southlimit

Spatial.Box.westlimit

Identifier Online linkage to dataset

Relation A reference to a related

resource

Source A reference to a resource

from which the present

resource is derived

Fifteen core elements are qualified by element refinements

resulting in a total of 23 fields.

Table 2

Prototype folder hierarchy in which datasets are filed and named

Container Name ISO Code List

Primary tier Date

Period

19115:MD_MaintenanceFrequencyCode

Secondary

tier

Access

Rights

19115:MD_RestrictionCode

Personal

geodatabase

Subject

Keyword

19115:MD_TopicCategoryCode

Feature

dataset

Coverage

Spatial

Box

Name

3166-2

Feature

class

Subject

Keyword

19115:MD_TopicCategoryCode

Entire code lists need not be replicated within the hierarchy:

containers may be created as required on filing new datasets.

Fig. 2. Illustration of prototype data storage hierarchy, yielding

five metadata elements. Container tags are assigned on the basis

of specific metadata entity code lists detailed in Table 2.

J.K. Batcheller / Computers & Geosciences 34 (2008) 387–398 393

according to specific metadata vocabulary terms bywhich they are unambiguously characterised, facil-itating dataset management while contributing toautomated metadata record compilation.

The test scenario comprised of a three-tieredfolder hierarchy—a root directory, a primary tierand a secondary tier. Each tier beneath the rootdenotes a specific metadata element, holding con-tainers labelled using code lists of commonly usedISO standards (Table 2). Personal geodatabasesare similarly tagged and stored in a location withinthe hierarchy which best reflects the attributes of thedata within. Personal geodatabase constituents are

likewise managed, with appropriate code lists defi-ning how collections of geographic features (featureclasses) and their aggregations (feature datasets) areannotated. The test hierarchy is illustrated in Fig. 2.

The test hierarchy illustrates how the elements ofthe adopted metadata standard may be used tocoordinate dataset storage, and later contribute tometadata record creation. The choice of element fora particular tier will depend on the applicationdomain; Date Period for instance was chosen as

Page 8: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESS

13ISO 19109:2005 Geographic Information—Rules for Appli-

cation Schema standard for instance allows for the definition of

conceptual data models which define the logical structure of an

application’s data, commonly instantiated by feature types

defined in feature catalogues.

J.K. Batcheller / Computers & Geosciences 34 (2008) 387–398394

primary in the prototype hierarchy to reflect ascenario whereby datasets are organised and filedfirst by the frequency they need to be updated.Folders (containers) residing on this tier are taggedusing the domain codes and names—defined in theISO 19115:MD_MaintenanceFrequencyCode codelist—appropriate for the data to be containedtherein. Subsequent tiers and geodatabase objectsare tagged in a similar fashion. For metadatacontribution, extraction routines read each tag ina given dataset’s path; domain code—name pairsmay be used in combination or may be parsed asneeded.

The approach represents a plausible data man-agement protocol which may be readily adapted todifferent application domains. The entire hierarchyor geodatabase configuration need not be recreatedfor effective contribution to generating metadata;indeed, a subset of each code list may be preferred,with containers created only when needed toorganise incoming datasets. The contribution ofthis storage classification strategy is clearly boundto the number of tiers in the combined folder andgeodatabase hierarchy. Currently, three simpleelements (Date Period, Access Rights and SpatialBox Name) and one compound element (depictingtwo Subject Keywords) are derived.

The method for extracting metadata elementsfrom a dataset meanwhile presupposes that theyhave been comprehensively compiled. Additionaldataset properties, whether supplied by the authoror calculated in the process of registering geo-graphic objects, present the opportunity for extend-ing what can be extracted from a data store. Heldalongside the data in a similar manner as spatialreferencing information such as projection, coordi-nate system etc, details may be mined using customcode and transformed into usable metadata ele-ments. Currently a single element—Alternative(title)—is created using this extraction method;scope remains to retrieve other elements notcurrently treated formally as metadata by theprogram for more detailed standards (includingfor instance spatial resolution and certain verticalextent attributes).

The routines which extract elements from thedata may depend upon a formal attribute schema,or may allow for a relative lack of structure.Indexing frequently occurring textual attributesmay for instance serve to extract values which canbe adopted as keyword elements; retrieving featuretype definitions may nevertheless suffice if a feature

catalogue-based schema13 is adhered to. Date fieldsmay similarly be queried indiscriminately to yieldpotential maximum and minimum values for a daterange of use element, or they may be referenced directlyin the event that a predictable data standard isemployed. In the present work, a Revision Dateelement is derived from the contents of a specific fielddenoting the date of update of each individual feature.

3.4. Template harvesting

In a method adopted by many geospatial metadataeditors, reusable content may be stored in XML filesfor harvesting during metadata production. Theapproach is extended herein through the associationof variables managed by the underlying operatingsystem with these pre-prepared templates. Details of adataset’s Creator, Publisher or Contributor for thecurrent Dublin Core profile can for instance beautomatically incorporated within each metadatarecord on the basis of variables such as the currentusername or the domain of the user’s workstation.XML template constituents are again addressed usingXPath expressions and retrieved in a similar manner asthe method for inherent metadata above.

3.5. Metadata editing interface

The tool centres on a form interface throughwhich routines are initiated and metadata items areedited, with the form’s fields corresponding to thecontent of the chosen metadata standard (Fig. 3).Inherent metadata elements retrieved through initialharvesting are used to pre-populate the form, suchelements being the most up-to-date and which aretypically not edited. Extraction and harvestingroutines can either be performed during initial pre-population, or manually executed once the form hasloaded on-screen. Here, standard elements (forwhich metadata generating procedures may beapplied) are selected in either their entirety or inany specified combination prior to initiation.Metadata items retrieved may be edited andadditional ones supplied in an interface completewith similar content guiding mechanisms as those ofthe stand-alone editors mentioned previously.

Page 9: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESS

Fig. 3. Prototype’s central interface. Routines may be run selectively via the Generate/Output menu items or collectively via the Run All

button. Routines and elements may also be selected/deselected (via the Configure menu), enabling mediator customisation of the Run All

operation set.

J.K. Batcheller / Computers & Geosciences 34 (2008) 387–398 395

3.6. Record validation

The method for validating completed metadatarecords will depend upon the application for whichthey are intended. Records destined for publicationon a geospatial data clearinghouse service maydemand strict quality control to ensure that boththe XML output is well-formed and the contentvalues are within allowable ranges; applications withless stringent quality demands may simply requirerudimentary content validation. XML Schema vali-dation in the first instance may be facilitated with thesupport of the Microsoft XML Core Services(MSXML) which accompany the .NET platform.Metadata records produced to meet a specificstandard are output to XML where they may betested for compliance (using methods provided byMSXML) against a corresponding XSD schemaregistered with the tool. As an alternative to schemavalidation, metadata editor fields may be verifiedprior to export to file using integrated spell-checking,domain value look-ups and other integrity measuressuch as verification of mandatory field completion.

3.7. Metadata output

One of the key components of the tool is thestandard mapping or crosswalk file. Not only usedto support the aforementioned metadata harvestingtechniques, the file also provides a means of cross-referencing metadata standard elements. Each rowin the mapping table denotes a metadata element;columns denote each specified metadata standard.Field values are in the form of XPath referenceswhich are used to read from and write to XMLmetadata. Once the metadata editing form iscomplete, elements can be written back to theArcCatalog-native metadata for association withthe active dataset or exported to external XML filesaccording to the standard(s) included in the cross-walk file.

4. Results

The prototype offers a reasonable saving in effortfor the metadata producer, albeit measured in thenumber of elements automatically populated and

Page 10: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESSJ.K. Batcheller / Computers & Geosciences 34 (2008) 387–398396

not actual time savings. Of the total 23 metadatastandard elements outlined in Table 1 above, 20were automatically generated using the prototypemethod. Abstract fields typically demand somedegree of intellectual forethought as to theircontents; no direct attempt was therefore made toaddress this element. Other metadata items such asKeywords could nevertheless be used to seed theAbstract entry and guide the manual composition ofsaid description. Relation and Source elements weresimilarly omitted as the test data was neither derivednor necessarily related to other existing resources;such elements could however be derived from otherfeature classes within the same feature dataset(Relation) or from the collection of lineage in-formation if specifically supported (Source).

With regards a dataset’s physical location and itsimpact on element extraction, this will be bound tothe value of n, or the number of contributing levelsin the adopted folder hierarchy. Increasing values ofn will increase metadata element contribution butpotentially complicate dataset classification andstorage; the n value adopted should offer acompromise between maximising element contribu-tion and the need to retain navigability of thestorage hierarchy. A value of two was deemedappropriate for the current investigation—argu-ments for exceeding this in any future implementa-tion should be carefully considered.

Feature datasets, employed to enable topologicaloperations and impose a further degree of dataorganisation, must contain feature classes fallingwithin the same geographic region and were there-fore considered ideal for binding with elementsdepicting spatial coverage. While no similar matchpresented itself for feature classes, the use of theKeyword element was deemed a reasonable fit in thepresent context. Extending the involvement ofhierarchical geodatabase components to metadatageneration beyond that made by feature datasetsand feature classes was not pursued. Nevertheless,internal aggregations of features supported by thegeodatabase’s object-relational model could con-ceivably be leveraged to contribute, and is conse-quently noted.

Initial harvesting of inherent metadata and theextraction of data properties meanwhile doesdemand extra effort and diligence when it comesto data preparation, but offer the additional benefitof enhancing data quality. Whilst the quality ofinherent metadata by and large depends uponappropriate dataset initialisation (registering the

correct projection details for example), data prop-erty extraction relies on the completion of datasetvariables which may or not be required within anorganisation’s application domain. Any decision tocater for such variables will depend on whether it isintended to use metadata to market data resourcesand whether the adopted metadata standard sup-ports an equivalent element, all the while bearing inmind that an explicit declaration of ‘no value’ for ametadata element eliminates the uncertainty blankentries present.

Despite the success of querying and indexingtechniques employed on unstructured attributetables, it is suggested that element extraction isbetter facilitated if such tables conform to astandard data schema whose constituents can beconsistently and reliably referenced. Here, extraimplementation costs can be partially offset throughthe incorporation of standard-compliant data dic-tionaries within surveying equipment, howeverquerying and indexing routines may be preferredin instances where spatial data standards aredeemed too unwieldy to implement. A furtheralternative would be the introduction of a morelightweight feature-level metadata standard onwhich to base extraction techniques. Advances inrepresenting attribute semantics may well holdparticular relevance here, such as the contributionof ontology-based metadata as presented bySchuurman and Leszczynski (2006).

5. Discussion

An approach for enabling the rapid production ofgeospatial metadata has been proposed, one whichdemonstrated the potential opportunities geospatialdata, their applications and management practicespresent when undertaking automated metadatageneration. While the desire is to automaticallypopulate the maximum number of elements permetadata standard used, the role of human media-tors for the purposes of quality evaluation shouldnot be overlooked. As the approach focuses onmore than pure metadata record completion, that is,it is also predicated on good data preparation andmanagement practices, there are more potentialpoints of entry for errors, with the subsequent needfor extra diligence during data creation.

It could be argued that what is presented here isnot so much the automatic generation of metadatabut the transfer of effort from metadata productionto data management. While this is certainly, but not

Page 11: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESSJ.K. Batcheller / Computers & Geosciences 34 (2008) 387–398 397

exclusively, the case, it is proposed as a soundmetadata management model as it encourages awell-defined data storage scheme and good datapreparation; it frees up authoring resources whichmay now be applied to descriptive metadata andquality control—a conspicuous benefit in caseswhere data documenters and data authors aredistinct; it safeguards metadata quality (contingenton appropriate dataset categorisation and datapreparation) as elements are retrieved not enteredmanually; and presents a significant net saving oftime despite the potentially high initial investment.Geospatial metadata has long been advocated tofacilitate the management of data collections; thecurrent approach takes this one step further, usingmetadata standard elements to plan data filing andin the process, contribute to metadata production.And although the test configuration may bepotentially perceived as construed, it is argued thatdata management, by definition, should adhere to aformalised, predictable structure to best facilitatedata categorisation and location.

While the present tool demonstrates a functionalset of options for the automatic generation ofmetadata, potential for extension remains. Asidefrom adapting it to incorporate support for themulti-user geodatabases prevalent in corporate GISenvironments, further scope exists for derivingmetadata elements from both the data and the hostsystems. Geographic extent names can be calculatedfrom the data’s geometry by overlaying it with areference place-name and boundary dataset bundledwith the tool. Scripts which track data transforma-tions and changes at the file level may be integratedwith existing provision for monitoring data prove-nance to produce more detailed information on adataset’s pedigree. For organisations participatingin data exchange initiatives, the tool can beextended to export an interoperable version of itsdata along with its metadata. By coupling thisoutput with existing metadata serving softwareand an open source mapping server, an inexpen-sive means of visualisation can be supportedfor organisations wishing to market their dataholdings.

6. Conclusions

It has not been the intention to laud a specificproprietary software offering, or proprietary GIS ingeneral, merely to present what has been possible toachieve extending the basic functionality one offer-

ing provides in facilitating the production of qualitygeospatial metadata. Whether adopted in theirentirety or in piecemeal fashion, it is contendedthat gains are to be had in tackling metadatageneration bottlenecks and data management issueswith approaches based on those outlined above.

Employing ArcGIS, the coupling of metadataproduction with dataset workflows was enabled, aswas the exploitation of existing programmingframeworks for development. While open sourceGIS offerings have made great strides in recenttimes—particularly in the realm of data sharing andvisualisation across the web—the lack of a mature,widely adopted production-level desktop GIS withsophisticated metadata support argued against theiruse. Such a surfeit may be viewed as curiousconsidering the weight of support behind geospatialdata exchange efforts dealing with standardisa-tion, communication protocols and software, parti-cularly considering the pivotal role metadata playstherein.

Considering the attention paid to such datasharing initiatives, one can easily get the impressionthat the actual generation of quality data surrogatesis taken for granted; evidence from the UK’snational metadata service GIgateway would suggestthat this is in fact one of the major obstacles togeospatial data exchange. Obligating and incentivis-ing the supply of geospatial metadata will only everwork up to point; the key is to encourage a changein mentality towards the process of documentingdata. Automating metadata production can helpfacilitate this, and supports the call to expand thescope of GI exchange efforts to include geospatialmetadata generation.

The continuing convergence around ISO stan-dards can contribute here, providing a common wayto encode and manipulate metadata, thus increasingthe scope for interoperability amongst emergingmetadata solutions. The aim should be the devel-opment of a generic tool, readily adaptable to anyapplication domain, yet capable of easing theburden of metadata authoring irrespective of datastrategy employed. Existing provision for interoper-ability in proprietary solutions using OGC-compli-ant strategies could be extended for this purposethrough the support of standard APIs to allow fullaccess to data stores regardless of origin. GML maypresent an option here; a better approach howeverwould be to access data in its native form withoutthe need for transformation and subsequent poten-tial for data loss.

Page 12: Automating geospatial metadata generation—An integrated data management and documentation approach

ARTICLE IN PRESSJ.K. Batcheller / Computers & Geosciences 34 (2008) 387–398398

References

Anderson, J.D., Perez-Carballo, J., 2001. The nature of indexing:

how humans and machines analyze messages and texts for

retrieval: part I: research, and the nature of human indexing.

Information Processing and Management: An International

Journal 37 (2), 231–254.

Batcheller, J.K., Gittings, B.M., 2006. Avenues for developing

the UK’s National Geospatial Metadata Service. In: Proceed-

ings of the Geographical Information Science Research UK

14th Annual Conference, University of Nottingham, Notting-

ham, UK, pp. 259–262.

Batcheller, J.K., Gittings, B.M., Dowers, S., 2007. The perfor-

mance of vector oriented data storage structures in ESRI’s

ArcGIS. Transactions in GIS 11 (1), 47–65.

Craven, T., 2001. DESCRIPTION meta tags in public home and

linked pages. LIBRES: Library and Information Science

Research Electronic Journal 11 (2).

FGDC, 1998. FGDC-STD-001-1998. Content standard for

digital geospatial metadata. Federal Geographic Data Com-

mittee, Reston, VA, USA, 90pp.

Foy, F.B., 2001. Metadata made easier? Development of

improved online tool for ‘‘askGIraffe’’. M.Sc. Thesis, Uni-

versity of Edinburgh, Edinburgh, 21pp.

Gobel, S., Lutze, K., 1998. Development of meta databases for

geospatial data in the WWW. In: Proceedings of the Sixth

ACM International Symposium on Advances in Geographic

Information Systems 1998, ACM, Washington, DC, USA,

pp. 94–99.

Goodchild, M.F., Haining, R.P., 2004. GIS and spatial data

analysis: converging perspectives. Papers in Regional Science

83, 363–385.

Greenberg, J., 2003. Metadata generation: process, people and

tools. Bulletin of the American Society for Information

Science and Technology 29 (2).

Greenberg, J., 2004. Metadata extraction and harvesting: a

comparison of two automatic metadata generation applica-

tions. Journal of Internet Cataloging 6 (4), 59–82.

Greenberg, J., Spurgin, K., Crystal, A., 2006. Functionalities for

automatic metadata generation applications: a survey of

metadata experts’ opinions. International Journal of Meta-

data, Semantics and Ontologies 1 (1), 3–20.

Guptill, S.G., 1999. Metadata and data catalogues. In: Longley,

P., Goodchild, M.F., Maguire, D.J., Rhind, D.W. (Eds.),

Geographical Information Systems. Wiley, Chichester,

pp. 677–692.

ISO, 2005. BS EN ISO 19115:2005. Geographic Information—

Metadata. BSi British Standards, Failand, Bristol, UK,

154pp.

ISO, 2006. BS EN ISO 19106:2006. Geographic Information—

Profiles. BSi British Standards, Failand, Bristol, UK, 32pp.

Kim, T.J., 1999. Metadata for geo-spatial data sharing: a

comparative analysis. The Annals of Regional Science 33,

171–181.

Leiden, K., Laughery, K.R., Keller, J., French, J., Warwick, W.,

Wood, S.D., 2001. A Review of Human Performance Models

for the Prediction of Human Error. National Aeronautics and

Space Administration, Moffett Field, CA, USA, 125pp.

Liddy, E.D., Sutton, S.A., Paik, W., Allen, E., Harwell, S.,

Monsour, M., Turner, A., Liddy, J., 2001. Breaking the

metadata generation bottleneck: preliminary findings. In:

Proceedings of the First ACM/IEEE-CS Joint Conference on

Digital Libraries, Roanoke, Virginia, 464pp.

Lim, E.-P., Liu, Z., Yin, M., Goh, D.H.-L., Theng, Y.-L., Ng,

W.K., 2005. On organising and accessing geospatial and

georeferenced web resources using the G-Portal System.

Information Processing and Management: An International

Journal 41 (5), 1277–1297.

Limbach, T., Krawczyk, A., Surowiec, G., 2004. Metadata

lifecycle management with GIS context. In: Proceedings of the

10th EC GI & GIS Workshop, ESDI State of the Art,

Warsaw, Poland.

Longhorn, R.A., 2005. Geospatial standards, interoperability,

metadata semantics and spatial data infrastructure. In: NIEeS

Workshop on Activating Metadata, Cambridge, UK, 23pp.

Mathys, T., 2004. The Go-Geo! Portal metadata initiatives. In:

Proceedings of the Geographical Information Science Re-

search UK 12th Annual Conference, University of East

Anglia, Norwich, UK, pp. 148–154.

Moellering, H., 1992. Opportunities for use of the spatial data

transfer standard at the sate and local levels. Cartography and

Geographic Information Systems 19 (5), 332–334.

Qi, L., Lingling, G., Feng, H., Yong, T., 2004. A unified metadata

information management framework for digital city. In:

Proceedings of IEEE’s Geoscience and Remote Sensing

Symposium, Anchorage, Alaska, USA, pp. 4422–4424.

Schuurman, N., Leszczynski, A., 2006. Ontology-based metada-

ta. Transactions in GIS 10 (5), 709–726.

Stephens, R., Hochgurtel, B., 2002. Visual Basic .NET and XML.

Wiley, New York, USA, 530pp.

Taylor, M., 2004. Metadata—describing geospatial data. In:

Nebert, D.D. (Ed.), Developing Spatial Data Infrastructures:

The SDI Cookbook Version 2.0. The Global Spatial Data

Infrastructure Association, pp. 24–38.

Tsou, M.-H., 2002. An operational metadata framework for

searching, indexing, and retrieving distributed geographic

information services on the Internet. In: Egenhofer, M.,

Mark, D. (Eds.), Lecture Notes in Computer Science, vol.

2478. Springer, Berlin, pp. 313–332.

West Jr., L.A., Hess, T.J., 2002. Metadata as a knowledge

management tool: supporting intelligent agent and end user

access to spatial data. Decision Support Systems 32, 247–264.