9
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 51, NO. 11, NOVEMBER 2013 5073 Mapping Geospatial Metadata to Open Provenance Model Chen-Chieh Feng Abstract— This paper maps the data lineage entities in ISO19115 and ISO19115-2, the metadata standards of the International Organization for Standardization for geographic information and for imagery and gridded data [ISO geospatial metadata (GMD)], to the entities in open provenance model (OPM). The term “map” refers to establishing a correspondence between the said entities in ISO GMD and OPM. Presently, many geospatial data available in spatial data infrastructures (SDI) are described using ISO GMD. Its structure, however, makes tracing the provenance of these data a challenging task. OPM prioritizes causal relationships between things for capturing the workflow applied to particular data, making it easier to trace the data provenance. The mapping in this paper provides a convenient means to trace the provenance of data through the OPM causal relations and evaluate the fitness for use of these data, a necessary step toward data integration. This paper uses the notion of process to identify various data processing activities encoded in ISO GMD, the resource and the agent types involved in these activities, and state changes. A software prototype to carry out the mapping is developed. The mapping result is encoded in the resource description framework format to permit integral use of geospatial data in SDI and the data from the open data world. An exemplar metadata in ISO GMD from the National Oceanographic Data Center of the National Oceanic and Atmospheric Administration is used to demonstrate the feasibility to convert from the ISO GMD data lineage entities to the OPM entities. Index Terms— Mapping, metadata, open provenance model (OPM). I. I NTRODUCTION G EOGRAPHIC information metadata are an important source from which the geospatial data users acquire the organizational, process, or knowledge information about the data [1], [2]. Many metadata found in major spatial data infrastructures (SDI) are encoded in the metadata schema of ISO19115 [3] and ISO19115-2 [4], the International Organi- zation of Standardization for geographic information metadata and its extension for imagery and gridded data. For brevity, these two metadata standards will collectively be termed as ISO geospatial metadata (GMD). Attributes in ISO GMD, which overlap with at least 14 of the 18 concepts for measuring information quality on semantic web [5], including those Manuscript received September 30, 2012; revised January 30, 2013; accepted March 1, 2013. Date of publication April 17, 2013; date of current version October 24, 2013. This work was supported in part by the Singa- pore National Research Foundation through the Singapore-MIT Alliance for Research and Technology Center for Environmental Sensing and Monitoring Sub-Award 69 under Grant R-109-000-142-592. The author is with the Department of Geography, National University of Singapore, 117570 Singapore (e-mail: [email protected]). Digital Object Identifier 10.1109/TGRS.2013.2252181 related to spatial and temporal information, responsible parties, lineage information, various data quality and conformance test results, and usage information, are especially useful for evaluating the quality of spatial data in cyberinfrastructure. Based on the increasing number of geographic information metadata being offered in ISO GMD, they have become important sources from which the evaluation of fitness for use of the data is made possible. Data provenance overlaps extensively with geographic infor- mation metadata in the kind of information about the data they provide, but it emphasizes data lineage and the possibility to track, report, reproduce, and roll back results for verifying the authenticity of the data [6]–[8]. More importantly, if one treats environmental simulation models as knowledge systems that provide the best knowledge of the environment to date, the ability to gain deeper understanding of inner workings of the environmental dynamics using the model’s data provenance is critical [9]–[11]. As these aspects of data quality rely heavily on the capturing of causal dependencies between various things, data provenance is better encoded using causal graphs than the attribution metadata style in the traditional geographic information metadata content standards. The underlying struc- ture of the open provenance model (OPM), a model designed to capture data provenance, is based on causal graphs. Causality-based data provenance also sees its significance in web processing applications, specifically in the linked open data applications [12] where data are encoded in a way that enables software agents to understand the meaning of web content and process information on the web more accurately [13]. In such applications, the resource description framework (RDF) format is typically used to encode data and data semantics. The format lends itself to represent causal graphs, and is supported by SPARQL, an RDF query language recommended by the World Wide Web Consortium [14], to identify data based on the data semantics encoded in RDF. In highly publicized cases of emergency response to the earthquake in Haiti [15], data in RDF (e.g., user-generated or in-situ sensor data) are often linked with data traditionally provided in SDI and described by the ISO GMD, and then processed in a series of chained web-based geoprocessing services that are specified as workflows [16], [17]. For these cases involving multiple web-based geoprocessing services and input data of various levels of quality, the ability to integrate data provenance from SDI and the open data world is critical as the quality of the result is conditioned by the output format as well as the completeness and precision of the metadata provided with the data. 0196-2892 © 2013 IEEE

Mapping Geospatial Metadata to Open Provenance Model

  • Upload
    dotuyen

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mapping Geospatial Metadata to Open Provenance Model

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 51, NO. 11, NOVEMBER 2013 5073

Mapping Geospatial Metadata toOpen Provenance Model

Chen-Chieh Feng

Abstract— This paper maps the data lineage entities inISO19115 and ISO19115-2, the metadata standards of theInternational Organization for Standardization for geographicinformation and for imagery and gridded data [ISO geospatialmetadata (GMD)], to the entities in open provenance model(OPM). The term “map” refers to establishing a correspondencebetween the said entities in ISO GMD and OPM. Presently, manygeospatial data available in spatial data infrastructures (SDI)are described using ISO GMD. Its structure, however, makestracing the provenance of these data a challenging task. OPMprioritizes causal relationships between things for capturing theworkflow applied to particular data, making it easier to tracethe data provenance. The mapping in this paper provides aconvenient means to trace the provenance of data through theOPM causal relations and evaluate the fitness for use of thesedata, a necessary step toward data integration. This paper usesthe notion of process to identify various data processing activitiesencoded in ISO GMD, the resource and the agent types involvedin these activities, and state changes. A software prototype tocarry out the mapping is developed. The mapping result isencoded in the resource description framework format to permitintegral use of geospatial data in SDI and the data from theopen data world. An exemplar metadata in ISO GMD from theNational Oceanographic Data Center of the National Oceanic andAtmospheric Administration is used to demonstrate the feasibilityto convert from the ISO GMD data lineage entities to the OPMentities.

Index Terms— Mapping, metadata, open provenance model(OPM).

I. INTRODUCTION

GEOGRAPHIC information metadata are an importantsource from which the geospatial data users acquire

the organizational, process, or knowledge information aboutthe data [1], [2]. Many metadata found in major spatial datainfrastructures (SDI) are encoded in the metadata schema ofISO19115 [3] and ISO19115-2 [4], the International Organi-zation of Standardization for geographic information metadataand its extension for imagery and gridded data. For brevity,these two metadata standards will collectively be termed asISO geospatial metadata (GMD). Attributes in ISO GMD,which overlap with at least 14 of the 18 concepts for measuringinformation quality on semantic web [5], including those

Manuscript received September 30, 2012; revised January 30, 2013;accepted March 1, 2013. Date of publication April 17, 2013; date of currentversion October 24, 2013. This work was supported in part by the Singa-pore National Research Foundation through the Singapore-MIT Alliance forResearch and Technology Center for Environmental Sensing and MonitoringSub-Award 69 under Grant R-109-000-142-592.

The author is with the Department of Geography, National University ofSingapore, 117570 Singapore (e-mail: [email protected]).

Digital Object Identifier 10.1109/TGRS.2013.2252181

related to spatial and temporal information, responsible parties,lineage information, various data quality and conformancetest results, and usage information, are especially useful forevaluating the quality of spatial data in cyberinfrastructure.Based on the increasing number of geographic informationmetadata being offered in ISO GMD, they have becomeimportant sources from which the evaluation of fitness for useof the data is made possible.

Data provenance overlaps extensively with geographic infor-mation metadata in the kind of information about the data theyprovide, but it emphasizes data lineage and the possibility totrack, report, reproduce, and roll back results for verifying theauthenticity of the data [6]–[8]. More importantly, if one treatsenvironmental simulation models as knowledge systems thatprovide the best knowledge of the environment to date, theability to gain deeper understanding of inner workings of theenvironmental dynamics using the model’s data provenance iscritical [9]–[11]. As these aspects of data quality rely heavilyon the capturing of causal dependencies between variousthings, data provenance is better encoded using causal graphsthan the attribution metadata style in the traditional geographicinformation metadata content standards. The underlying struc-ture of the open provenance model (OPM), a model designedto capture data provenance, is based on causal graphs.

Causality-based data provenance also sees its significancein web processing applications, specifically in the linkedopen data applications [12] where data are encoded in away that enables software agents to understand the meaningof web content and process information on the web moreaccurately [13]. In such applications, the resource descriptionframework (RDF) format is typically used to encode data anddata semantics. The format lends itself to represent causalgraphs, and is supported by SPARQL, an RDF query languagerecommended by the World Wide Web Consortium [14], toidentify data based on the data semantics encoded in RDF.In highly publicized cases of emergency response to theearthquake in Haiti [15], data in RDF (e.g., user-generatedor in-situ sensor data) are often linked with data traditionallyprovided in SDI and described by the ISO GMD, and thenprocessed in a series of chained web-based geoprocessingservices that are specified as workflows [16], [17]. For thesecases involving multiple web-based geoprocessing servicesand input data of various levels of quality, the ability tointegrate data provenance from SDI and the open data worldis critical as the quality of the result is conditioned by theoutput format as well as the completeness and precision ofthe metadata provided with the data.

0196-2892 © 2013 IEEE

Page 2: Mapping Geospatial Metadata to Open Provenance Model

5074 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 51, NO. 11, NOVEMBER 2013

Inspired by the research of [18] where elements of anattribute metadata, i.e., dublin core (DC), are mapped to OPMto leverage the strengths from both the attribution metadataand causality-based data provenance, this paper aims to mapthe data lineage entities in ISO GMD to the entities inOPM that provide primitives for encoding causal dependenciesbetween entities. The term “map” refers to establishing acorrespondence between two parties. The ISO GMD overlapswith DC in terms of their metadata elements and thus theresearch of [18] presents an ideal starting point for this paper.However, the two standards also differ in at least two waysand thus a direct adoption is not possible. First, the ISO GMDis for geospatial applications where the resources involvedare typically digital data undergoing multiple transformationsto alter the spatial properties or the attribute values of thedata. The DC is more for applications involving document-like resources. How the identity of a digital geospatial resourcemay change is more complex than a document-like resourcebecause in geospatial resource processing the output resourcecan retain the identity of the sole input resource or one of theinput resources. In some cases, the output resource can carrya brand-new identity. Second, the mixture of compulsory andoptional metadata elements in ISO GMD further complicatesthis issue, especially when an optional and a compulsorymetadata element offers similar information on a resource’sidentity or sometimes version. A careful examination of theseissues is needed before mapping the ISO GMD entities to theOPM entities, and thereby allowing the evaluation of fitnessfor use of applications that requires data from both the SDIand the open data world.

This paper is structured as follows. Section II describesOPM and ISO GMD. Section III describes the mappingfrom the ISO GMD entities to the OPM entities. Section IVdescribes uses of metadata from the National Oceano-graphic Data Center (NODC) of the National Oceanic andAtmospheric Administration (NOAA) as an example todemonstrate the usefulness of the mapping. The last sectionsummarizes a conclusion and presents future directions.

II. PROVENANCE AND METADATA STANDARDS

Here, the OPM and ISO GMD are briefly described.The description for OPM will focus on the causal dependencybetween its primitives and thereby its semantics. The full OPMspecification was presented in [19]. The description for the ISOGMD will focus on state change, causal relation, and identity.

A. OPM

OPM is a provenance model developed through a seriesof provenance challenges and discussions by its participants(http://openprovenance.org/model/opmo provides the contrib-utors to the OPM specification). It characterizes the causalrelations between three primitives: Artifact, Process, andAgent––and how instances of Artifact reach specific states.The model deals with well-defined past events. An Artifactrepresents an immutable piece of state. A Process refers toone or more actions that are either performed on existingor give rise to new Artifact instances. An Agent controls a

Fig. 1. OPM data model [19]. R is causal relation.

Process [19]. The three primitives are intentionally general toenable provenance interoperability [20].

Five casual relationships are defined in OPM (Fig. 1). AProcess instance can hold the Used relation with an Artifactinstance, whereas in the reverse direction the wasGeneratedByrelation holds. The two relations are interpreted as: 1) aProcess used an Artifact, and 2) an Artifact was generatedby a Process, respectively. A Process instance can hold thewasControlledBy relation with an Artifact instance. Roles thatare meaningful to a particular domain [represented by (R)in Fig. 1] can be specified for the first three relations toindicate the context under which these relations hold. Thelast two causal relationships are between two instances ofProcess and between two instances of Artifact. The relationwasTriggeredBy holds between two Process instances whereasthe relation wasDerivedFrom holds between two Artifactinstances. The former may be a simplified version of a chainedUsed and wasGeneratedBy causal relationship whereas thelatter implies a chained wasGeneratedBy and Used causalrelation. These relations can be time-stamped and chainedthrough the three primitives to fully describe the particularworkflow that changes the state of geospatial data.

Each primitive and causal relation is represented by a nodeand a directed edge that connects two nodes. For example,in Fig. 1 the Artifact and Process are two nodes that areconnected by the directed edge of Used and wasGeneratedBy.Formally, nodes and edges in OPM are annotable, i.e., theycan be attached with property-value pairs indicating label,persistent name, profile, and type. Customary objects andrelations can thus be defined for particular cases.

There are numerous applications of OPM in the geosciencedomain. For example, it served as the basis of a multi-tier model for capturing data provenance in global climaticresearch [21] and for capturing, creating, and publishinginformation about geoscience data [22] as well as virtualsensor system [23]. OPM is used to improve data interop-erability of the nested relational calculus model for workflowrepositories [24].

The interests in OPM and generally the provenance inthe Web lead to a chartered workgroup in W3C to definea provenance exchange language (http://www.w3.org/TR/prov-primer/) called PROV. Both OPM and PROV are high-level data models and are similar in the sense that each

Page 3: Mapping Geospatial Metadata to Open Provenance Model

FENG: MAPPING GMD TO OPM 5075

TABLE I

ROLES OF BINARY RELATIONS IN LINEAGE SECTION OF ISO

GMD [4] PERTAINING TO DATA PROVENANCE

Relation MeaningSource Information about the source data used in

creating the data specified by the scope.Output Description of the product generated as a result

of the process step.Algorithm Details of the methodology by which geo-

graphic information is derived from theinstrument readings.

of the three OPM primitives finds its counterpart primitivein PROV, i.e., ArtifactOPM and EntityPROV, ProcessOPM andActivityPROV, and AgentOPM and AgentPROV. Causal relation-ships in OPM and PROV are also rather similar because eachof the five OPM causal relations finds its counterpart relationin PROV. The only difference between the two provenancemodels lies in the additional two causal relationships formallydefined in PROV, i.e., wasAttributedTo between Entity andAgent, as well as actedOnBehalfOf between two Agents(http://www.w3.org/TR/prov-o/diagrams/starting-points.svg).

B. ISO GMD

The ISO standards for geographic information metadata arecomposed of a number of standards. The base standard ofthe whole collection of standards is the ISO19115 GeographicInformation–Metadata [3]. It provides the conceptual modelfor specifying the metadata of a geographic data set or anaggregate of data sets, and is widely used. ISO19115 has anextension for handling metadata elements for raster imageryand gridded data [4], but the extension is less used [25]. Theresearch presented in this paper applies to both ISO19115 andISO19115-2 to maximize its applicability. As in Section I, ISOGMD is used throughout the paper to refer to these two ISOmetadata standards.

Many ISO GMD entities provide direct or indirect infor-mation on the processes involved in creating or affectinggeospatial data. The lineage section under the data qualityinformation presents the most direct source of information fordata provenance. When fully specified, it describes individualprocess steps along with their input and output, when and whoexecuted these processes, and for each process, the software,run time parameters, and algorithms used, if applicable. Usingthe terminology from the unified modeling language [26],these pieces of information can be found in the ISO GMDfrom the roles associated with binary relations (Table I) andthe properties of classes (Table II). The data provenance inthe lineage section, thus, provide both high-level provenancethat describes conceptually the processing steps and low-levelprovenance that describes the implementation details such asthe software and the parameters used, and is fine-grainedaccording to [27].

In addition to the lineage section, the citation information(CI_Citation) of a data set provides an additionalsource of data provenance information. The date type(CI_Date.dateType), the role of the responsible party

TABLE II

PROPERTIES OF CLASSES IN LINEAGE SECTION OF ISO

GMD [4] PERTAINING TO DATA PROVENANCE

Property (Class) MeaningProcessor(LI_ProcessStep)

Identification of, and means ofcommunication with, person(s) andorganization(s) associated with theprocess step.

SourceCitation(LI_Source)

Recommended reference to be usedfor the source data.

SoftwareReference(LE_Processing)

Reference to document describingprocessing software.

RunTimeParameters(LE_Processing)

Parameters to control the processingoperations, entered at run time.

Citation (LE_Algorithm) Information identifying the algorithmand version or date.

TABLE III

CODE LISTS IMPLYING STATE CHANGES

Code Type Code ListCI_DateTypeCode Creation, publication, revision.CI_RoleCode (for responsibleparty)

resourceProvider, owner,user, distributor, originator,principalInvestigator, processor,publisher, author.

MD_ProgressCode (foroperation status)

Completed, historicalArchive,obsolete.

(CI_ResponsibleParty.role) associated with the CI_Citation,the operation status of the acquisition details section(MI_Operation.status), and edition of the CI_Citation(CI_Citation.edition), all reference data values that implystate change of geospatial data (Table III for the codes for thefirst three ISO GMD entities, respectively). For example, thedate type code creation gives rise to a new data set whereasthe date type code revision results in modifications of existingdata content. Permitting these codes, specifically the revision,implies that ISO GMD generally treats geospatial data asmutable entities, i.e., an entity can be modified while retainingits identity after it is created. The codes for the operationstatus (MD_ProgressCode) provide additional informationon state changes. Four codes of the operation status arenot considered because they represent future events (i.e.,planned) that cannot be handled by OPM [28] or provideinsufficient information on data provenance (i.e., ongoing,under development, and required).

Mutable entities in ISO GMD can be identified in variousways (Fig. 2). At the most fundamental level a mutableentity can be identified through the identifier associated withthe citation information (CI_Citation.identifier) as it uniquelyidentifies an instance of a data source [29]. The identifiercan be used jointly with other metadata entities, specificallythe temporal extent (EX_TemporalExtent), to distinguish datacollected in a repetitive data collection effort. Different ver-sions of a mutable entity can be formally captured as editions(CI_Citation.edition). Sometimes metadata entities other thanthe identifier are used for identifying data sources. For exam-ple, the U.S. Geoscience Information Network suggested usingalternate title of the data source (CI_Citation.alternateTitle)

Page 4: Mapping Geospatial Metadata to Open Provenance Model

5076 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 51, NO. 11, NOVEMBER 2013

to accommodate customized information [30]. The identi-fier can also be associated with a set of resources (aggre-gateDataSetIdentifier of MD_AggregateInformation) that arerelated in ways specified in the association type codes(DS_AssociationTypeCode).

The discussions here and in Section II-A show clearly thesimilarity between ISO GMD and OPM in terms of how dataprovenance is described. They both provide means to specifythe processes involved in generating a data set or changingthe state of a data set. In terms of describing the detail ofthese processes, they complement each other. OPM is domain-independent. It stays on a conceptual level, focusing only onthe abstract entities Agent, Artifact, and Process, as well asthe permissible casual relations between these abstract entities.ISO GMD is domain-specific. It provides entities and relationsfor describing data provenance for geospatial applications andis widely used in existing SDI. The mapping of the dataprovenance information in ISO GMD to OPM thus offers theopportunity to access data provenance in SDI to other domainsthat require the use of spatial data.

III. MAPPING ISO GMD ENTITIES TO OPM ENTITIES

Here, the mapping of the ISO GMD data lineage entitiesto OPM entities is proposed. The base mapping pattern isinitially established, followed by four types of refinements. Asdescribed in Section II, ISO GMD differs from OPM in thatthe former accepts mutable entities whereas the latter does notaccept such entities. To retain multiple versions of the sameentity, and thus the same identity, mapping a mutable entityfrom ISO GMD to an OPM entity implies using multiple OPMnodes to represent multiple instances of the same entity that areof different editions, status, and processed at different stages.

A. Base Mapping Pattern

The provenance information regarding data process historyis mainly encoded in data lineage section in which the process-ing step (LE_ProcessStep) provides the core lineage informa-tion. A process step can be associated with the roles (Table I),properties (Table II), and code lists explicitly referring to statechanges of the entities of interest (Table III).

With respect to roles, the source and the output associatedwith LE_ProcessStep point to a process that produces oralters one or more data sets. The two roles alone iden-tify the input–output relation between data sets. However,because it is not compulsory to specify the citation information(CI_Citation) of the two data sets, whether the process stepresults in new data sets or a revised existing data set some-times require consulting free-text description associated withthe process step (LE_ProcessStep.description). The propertyprocessor represents an agent controlling the process step. Therole Algorithm points to an artifact used to realize the process.Together with the property softwareReference that points to theartifact of the software packages used for the process and thesecond artifact that points to runtime parameters, they form alist of entities to trace low-level provenance of the data.

Fig. 3 shows the graphical representation of the basemapping pattern for the ISO GMD lineage section to OPM.

TABLE IV

BASE MAPPING FROM ISO GMD ELEMENTS TO OPM PRIMITIVES,

ROLES OF RELATIONS, AND PROPERTY VALUES

ISO GMD OPM

Name Entity Role of a Relation orProperty Value

Processor Agent WasControlByLE_Process Step ProcessLE_Source ArtifactRun Time Parameter ArtifactSoftware Reference ArtifactLE_Algorithm ArtifactParameter UsedOutput wasGeneratedBySource UsedProcessing WasTriggeredByAlgorithm type

For clarity, the ISO GMD entities are explicitly labeled (ISOGMD). The OPM entities are not labeled as the figure showsthe result of the mapping effort. Identical information is alsoshown in Table IV. Input source (the left LE_Source), outputsource (the right LE_Source), algorithm used (LE_Algorithm),and the processors executing the process step (processor)are indexed with two nonnegative integers. The first indexrepresents identity and the second index represents version.Artifacts runTimeParameters and Software do not have theseindexes because their cardinalities are either zero or one. Theroles source and output become the roles (represented by Rin Fig. 3) of the causal relations Used and wasGeneratedBy,respectively. They indicate the contexts in which the twocasual relations are valid. LE_Algorithm is referenced by theSoftware with the role Algorithm being the property name.A generic OPM edge with the property type is used toassociate softwareReference and the algorithm used in theprocess because the only causal relation in OPM associatingtwo Artifacts (i.e., wasDeriveFrom) does not apply in thiscase.

It should be highlighted that ISO GMD defines many free-text properties (i.e., properties whose values are of char-acter string type) that may hold provenance information.They can be compulsory, e.g., description of a process step(LE_ProcessStep.description), or optional, e.g., documentationof the processing information (LE_Processing.documentation).Although it is possible to insert structured information as free-texts to these properties, such as the cases in [30] and [32]where the namespace-code structure are used, generally,many such properties are filled with unstructured informationthat cannot be mapped automatically. However, they can bereferred for information describing casual relations or statechanges that fit into one of the mapping patterns above.

B. Refinement

Refinement of the base mapping pattern is then warrantedfor geospatial applications if additional information on dataprovenance is specified. A processing step can create, modify,or simply duplicate the source artifact depending on the rolesplayed by its processors (Table III). Based on the nature ofthe outcome, a processing step can be considered as following

Page 5: Mapping Geospatial Metadata to Open Provenance Model

FENG: MAPPING GMD TO OPM 5077

Fig. 2. Classes or attributes in ISO GMD that provide identifiers for mutable entities. TM_Primitive is abstract data type defined in ISO19108 [31] forrepresenting various types of time (e.g., time period). Attributes in CI_Citation, MD_Identification, and MD_DataIdentification irrelevant for identifyingmutable entities are omitted. Numbers of omitted attributes are specified.

Fig. 3. Base mapping pattern from LE_ProcessStep to OPM. Subscripts W–Z and i–k are indexes for identity and version, respectively. Elements labeled[ISO GMD] are ISO GMD entities; elements without labels are OPM entities. The symbol R is causal relation and ← is mapping from ISO GMD entity toOPM entity.

a creator or an affector pattern where the former patterncontributes to an artifact that is not in existence prior to theprocess execution whereas the latter pattern changes the stateof an existing artifact [18]. For a data source (LI_Source orLE_Source), the originator role clearly connects to the creatorpattern whereas the processor role connects to the affectorpatterns.

The connections to the two patterns for the remaining sevenroles are case dependent. Playing the ISO GMD roles of theowner and the user do not involve creating a new artifact butmay change some aspects of an artifact, whereas playing theroles of the author and the principalInvestigator may involve

creating and revising an artifact. Playing the roles of theresourceProvider, distributor, and publisher involve makingavailable an artifact for external consumption. The artifact’sidentity and content are invariant through the resource pro-viding, distributing, and publishing process but the time ofthe artifact being rendered is later than its time of creation.Similarly, the progress stage of a data set can be completed,archived (coded historicalArchive), or obsoleted. For thesethree codes the affector patterns apply. In addition, as inSection II-B, ISO GMD has one property recording repetitiveupdates and a second property accommodating versions of thesame artifact. If recorded, these two properties also provide

Page 6: Mapping Geospatial Metadata to Open Provenance Model

5078 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 51, NO. 11, NOVEMBER 2013

Fig. 4. Create mapping pattern. Elements labeled [ISO GMD] are ISO GMDentities; elements without labels are OPM entities. The symbol R is role ofOPM causal relation, ← is mapping from ISO GMD entity to OPM entity,and f is functional mapping from ISO GMD CI_Citation.identifer to OPMproperty uri.

explicit indicators for the same data with changes beingapplied.

Below, the refinement for the stated cases and their mappingto OPM are elaborated. The refinement will focus mainlyon the shaded area in Fig. 3. The mapping distinguishesemerge from affect pattern to highlight the difference betweenstate changes associated with or without identity changes.As in Section II-B, the identifier associated with citationinformation, along with additional metadata entities such astemporal extent, can be used for representing identity.

1) Create: Raw data collected by sensors tend to have noinput source but one or more output sources that are consideredto be raw or the first occurrence of data. They are typicallyassociated with the originator role. The mapping for suchrole in a data lineage to OPM eliminates the Used(R ←source[ISO GMD]) relation and the associated Artifact becausethey do not exist prior to the execution of the process step.The date and time of the process step, if specified, can bemapped to the time-stamp associated with the casual relationwasGeneratedBy using the OPM property exactlyAt. The dateproperty of the citation of an Artifact instance provides anothersource of information. If it refers to the date when the dataare created (CI_DateTypeCode in Fig. 2), the time-stampassociated with wasGeneratedBy is replaced by the OPM timeproperty noLaterThan (Fig. 4). The identity of the LE_Sourcecan be obtained by mapping its identifier to the OPM propertyuri.

2) Emerge: A process step for geospatial data can give riseto an Artifact instance with an identity that differs from theidentities of any input Artifact instances. Such a process isa generalized create process as it involves at least one inputArtifact instance. Running terrain analysis to extract drainagebasins from a digital elevation model is one such example.Several roles in ISO GMD, specifically the processor and toa lesser degree the author and the principleInvestigator, caninitiate a process of this kind.

The mapping for roles that result in the generation of newartifacts is shown in Fig. 5. It applies to processes that usecertain input artifacts to generate the output artifacts. Therecan be multiple input and output artifacts, indicated by the sub-scripts X and Y associated with the left and right LE_Sourcein the figure. For each LE_SourceY where the emerge patternapplies, it must be different from all LE_SourceX .

Similar to the Create pattern, the date and time associatedwith the process step may be specified or left empty. For thefirst case the date and time can be mapped to the time-stampsfor both the Used and wasGeneratedBy causal relations. Forthe second case, the temporal extent or the date specified in thecitation of each Artifact instance can be used as time-stampsof the two causal relations. The mapping should, however,distinguish between the time acquired from the source andoutput Artifact instances; the time acquired from the sourceand output Artifact instances have to be associated withthe OPM time property noEarlierThan for the Used causalrelation and the OPM time property noLaterThan for thewasGeneratedBy causal relation, respectively.

3) Affect: The pattern applies to processes that affect certainaspects of an artifact but keeps the theme and thus, itsidentity intact. Examples include reprojecting the data fromone georeference system to another and generating a revisedartifact (indicated by the ISO GMD date type code Revision).As discussed in the introduction of this section, capturing thesecases involving state change with invariant identity requiresintroducing an additional output artifact that carries the sameidentity of the input artifact but the former is a newer edition ofthe latter. The Affect pattern is therefore similar to the emergepattern but each of the output Artifact identity must find aninput Artifact instance of identical identity (X = Y ) and isof the previous version (i < j ). The date and time mappingfollows that of the emerge pattern.

4) Derive: There remain relations between two data sourcesthat have limited or no specification on the process involvedor agent participated in the change of state of each pair ofdata sources. However, metadata entities may clearly indicatethat one data source is derived from the second data source.These cases are mapped to wasDerivedFrom relation betweentwo Artifact instances. The properties edition and series of thecitation information (CI_Citation in Fig. 2) of a data layer aretwo other candidates for such mapping.

In [18], a general interrelation mapping is also proposed.An interrelation relate generations of artifacts. The derivemapping is a special case of the interrelation mapping as it ismore restricted to the formal OPM relation wasDerivedFrombetween two artifacts.

IV. PROTOTYPE

Here, the lineage of the coral bleach monitoring data setfrom the NOAA NODC is used to demonstrate the feasibilityto convert the ISO GMD data lineage entities to OPM entitiesproposed in Section III. The data set is developed as a meansto identify areas at risk for coral bleaching, such as thehotspots of various kinds of stresses to coral reefs. To achievethe stated goal, the remote sensing data from NOAA (e.g.,the sea surface temperature) are used as inputs for derivingindex values that represent environmental stresses to coralreefs. For example, the index coral bleaching degree heatweeks, which represents prolonged periods of thermal stressto coral and is shown to successfully generated bleachingwarnings [33], are derived using 12-week hotspot data fromthe same data set and are made available on the NOAA

Page 7: Mapping Geospatial Metadata to Open Provenance Model

FENG: MAPPING GMD TO OPM 5079

Fig. 5. Emerge or affect mapping patterns. Elements labeled [ISO GMD] are ISO GMD entities; elements without labels are OPM entities. Subscripts X, Yand i, j are indexes for identity and version. The symbol R is role of OPM causal relation, ← is mapping from ISO GMD entity to OPM entity, and f isfunctional mapping from ISO GMD CI_Citation.identifer to OPM property uri.

Fig. 6. Process step ps_229 for deriving night time sea surface temperature data and the related metadata entities.

website [34]. For a thorough overview of the data set, seehttp://coralreefwatch.noaa.gov/satellite/product_overview.html.

The metadata of the color bleach monitoring data setis encoded in ISO GMD using ISO19139 [35], an XMLschema implementation of ISO19115. It contains a variety ofprocessing steps and roles played by responsible parties (e.g.,the originator and the publisher in ISO GMD). An excerpt ofthe process step ps_229 for deriving night time sea surfacetemperature data and the related metadata entities are shownin Fig. 6. The description for the algorithm (LE_Algorithm)provides rich information regarding how the output is derived.It represents a case where a new Artifact instance (referencedby #src_SST_Anomaly) is generated by combining twoinput Artifact instances (referenced by #src_SST50_NIGHTand #src_SST_CLIMATE) using the algorithm specifiedin LE_Algorithm. Two similar process steps for the same

lineage group, ps_230 for deriving coral bleach hotspotsand ps_231 for deriving twice-weekly coral bleachingdegree heating weeks, and the full metadata can be foundat http://data.nodc.noaa.gov/nodc/archive/metadata/CLASS/iso/xml/CORBL.xml.

A prototype based on the aforementioned mapping pat-terns is developed for extracting OPM information from thegeographic information metadata file (in XML) using JAVA.Dom4j (http://dom4j.sourceforge.net), an open source libraryfor processing XML document developed by multiple devel-opers over several years, is used to parse and traverse themetadata file. The library comes with full support for W3Cdocument object model, Simple API for XML (SAX), andJAVA API for XML Processing (JAXP). In the prototype theDom4j classes SAXReader and Document are used to read theinput XML files, i.e., CORBL.xml, and the Document method

Page 8: Mapping Geospatial Metadata to Open Provenance Model

5080 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 51, NO. 11, NOVEMBER 2013

Fig. 7. OPM graph generated from process steps in CORBL.xml.

selectNodes()is used to locate specific nodes in CORBL.xml,e.g., those associated with LE_ProcessStep or LE_Source. Themapping patterns discussed in Section III are then applied inthe prototype to convert the ISO GMD metadata elements andrelations to OPM causal relationship graphs. Three formats ofOPM graphs, including JAVA representation, XML and RDF,are supported. Graphviz [36] is used to create structural OPMinformation as diagrams of abstract graphs.

Fig. 7 shows the OPM graph of the process steps ps_229,ps_230, and ps_231 generated by using the emerge pat-tern. Six Artifact instances are created in this data lineage.These Artifact instances correspond to either gmd:source orgmi:output in the metadata file. Other entities and annotationsare also extracted from the metadata file and mapped to OPMproperties, such as the mapping of the CI_Citation.title ingmd:LE_Source to the OPM property label. Based on theextracted provenance information, queries in SPARQL can bedeveloped to identify what are the processes and input Artifactinstances involved in generating the output Artifact instancesand when are the output Artifact instances generated.

V. CONCLUSION

This paper mapped the data lineage entities in ISO19115and ISO19115-2 (ISO GMD), the two ISO standards forgeographic information metadata and for imagery and griddeddata, to entities in OPM. It aimed to facilitate data integrationthrough explicit specification of the causal relations in ISOGMD without demanding data users to learn both ISO GMDand OPM semantics. The mapping allowed data users thecombine use of structured geospatial data typically storedin SDI and the less structured data in the open data world,and accessed the fitness for use of these data accordingly.A base mapping along with four types of refinement, whichincluded Create, Emerge, Affect, and Derive, were developed.The refinement was based on how identity and state of anArtifact instance may change in typical GIS and remotesensing applications and how these changes were specified

in ISO GMD. A prototype system based on the Dom4j libraryin JAVA was developed to map the ISO GMD entities to thecorresponding OPM entities.

The flexible structure of the ISO GMD, specifically theexistence of optional metadata entities and the possibilityto specify related provenance information with multiple ISOGMD entities, can result in alternative mapping outcomes. Thechange of OPM time labels from exactlyAt to noLaterThan isone such example. The change of the OPM causal relationfrom Used-WasGeneratedBy pair to wasDerivedFrom as aresult of the absence of the process step information rep-resented a second example of this kind. Several pieces ofdata provenance information in ISO GMD cannot be mappedmainly because of them being specified in unstructured texts(e.g., descriptions in character strings) and OPMs limitationon capturing future activities. The problem associated withunstructured texts was partially alleviated with custom-definedstrings recognizable to a certain information community. Fullutilization of these pieces of information, however, requireddeveloping parsers for converters for individual informationcommunities.

The result of mapping the data lineage entities in ISO GMDto entities in OPM naturally provided a means to identifyArtifacts, Agents, and Processes participated in changing statesof certain artifacts, the time periods within which the statechanges occurred, and their roles. Existing literature [5] iden-tified other types of provenance concepts that were crucialto the evaluation of the information quality. In addition totraditional concepts that were explicitly recognized in the ISOGMD, such as certification authority and license, a newerconcept, i.e., social descriptors that permitted individuals todirectly or indirectly support or oppose certain artifacts, wasmissing. Although extending the ISO GMD was possible,mapping these concepts to OPM required a formal means toencode provenance semantics therefore, interoperability of theprovenance information was possible [37].

In the future, geospatial ontologies that describe preciselythe data semantics of geospatial data should be developed andincorporated as a part of data provenance. This is not an easytask as geoprocessing workflows involve many kinds of algo-rithms handling spatial relations between various data inputsand these data may be generated by communities subscribingto different views on the same physical reality. Carrying outsuch task, however, is critical for combining structured datain the SDI with less structured data from the open data worldas it enables users to query data provenance using precisesemantics grounded in well-defined ontologies. Means to auto-matically discover the causal relationships described in free-text fields should also be explored. These tools can improvethe completeness of the provenance of a data source as it is notuncommon to find provenance information provided in free-text forms, such as the description attribute of LI_Source andLI_ProcessStep.

ACKNOWLEDGMENT

The author would like to thank J. Wu for his assistance indeveloping the prototype, and the three anonymous reviewersand the editors for their valuable comments.

Page 9: Mapping Geospatial Metadata to Open Provenance Model

FENG: MAPPING GMD TO OPM 5081

REFERENCES

[1] J. Zhao, C. Goble, M. Greenwood, C. Wroe, and R. Stevens, “Anno-tating, linking and browsing provenance logs for e-Science,” in Proc.Workshop Semantic Web Technol. Search. Retr. Sci. Data, 2003,pp. 158–176.

[2] H. Moellering, H. J. Aalders, and A. Crane, World Spatial MetadataStandard: Scientific and Technical Characteristics, and Full Descrip-tions with Crosstable. Amsterdam, The Netherlands: Elsevier, 2005.

[3] Geographic Information—Metadata, ISO Standard 19 115:2003(E),2003.

[4] Geographic Information Metadata Part 2: Extensions for Imagery andGridded Data, ISO Standard 19 115-2:2009(E), 2009.

[5] A. Freitas, T. Knap, S. O’Riain, and E. Curry, “W3P: Building an OPMbased provenance model for the Web,” Future Generat. Comput. Syst.,vol. 27, no. 6, pp. 766–774, Jun. 2011.

[6] Y. L. Simmhan, B. Plale, and D. Gannon, “A survey of data provenancein e-science,” ACM SIGMOD Rec., vol. 34, no. 3, pp. 31–36, Sep. 2005.

[7] C. Tilmes, Y. Yesha, and M. Halem, “Tracking provenance of earthscience data,” Earth Sci. Inf., vol. 3, nos. 1–2, pp. 59–65, 2010.

[8] P. Yue, J. Gong, and L. Di, “Augmenting geospatial data provenancethrough metadata tracking in geospatial service chaining,” Comput.Geosci., vol. 36, no. 3, pp. 270–281, Mar. 2010.

[9] L. Di, “A framework for developing web-service-based intelligentgeospatial knowledge systems,” J. Geograph. Inf. Sci., vol. 11, no. 1,pp. 24–28, Jan. 2005.

[10] P. Fox, D. L. McGinnes, L. Cinquini, P. West, J. Garcia, J. L. Benedict,and D. Middleton, “Ontology-supported scientific data frameworks:The virtual solar-terrestrial observatory experience,” Comput. Geosci.,vol. 35, no. 4, pp. 724–738, Apr. 2009.

[11] D. A. Bennett, W. Tang, and S. Wang, “Toward an understanding ofprovenance in complex land use dynamics,” J. Land Use Sci., vol. 6,nos. 2–3, pp. 211–230, 2011.

[12] C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee. (2008, Apr.). Pro-ceedings of the Linked Data on the Web Workshop [Online]. Available:http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-369/

[13] T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,” Sci.Amer., vol. 286, no. 5, pp. 34–43, May 2001.

[14] SPARQL Query Language for RDF. (2008) [Online]. Available:http://www.w3.org/TR/rdf-sparql-query/

[15] J. Masó, X. Pons, B. Schäffer, T. Foerster, and R. Lucchi. (2011, Mar.).Haiti Earthquake: Harmonizing Post-Event Distributed Data Processing[Online]. Available: http://www.earthzine.org/2011/03/18/haiti-earthquake-harmonizing-post-event-distributed-data-processing/

[16] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li,and T. Oinn, “Taverna: A tool for building and running workflows ofservices,” Nucleic Acids Res., vol. 34, no. 2, pp. 729–732, Jul. 2006.

[17] D. Barseghian, I. Altintas, M. B. Jones, D. Crawl, N. Potter, J. Gal-lagher, P. Cornillon, M. Schildhauer, E. T. Borer, E. W. Seabloom,and P. R. Hosseini, “Workflows and extensions to the Kepler scientificworkflow system to support environmental sensor data access andanalysis,” Ecol. Inf., vol. 5, no. 1, pp. 42–50, Jan. 2010.

[18] S. Miles, “Mapping attribution metadata to the open provenance model,”Future Generat. Comput. Syst., vol. 27, no. 6, pp. 806–811, Jun. 2011.

[19] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwas-nikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan,E. Stephan, and J. V. den Bussche, “The open provenance model corespecification (v1.1),” Future Generat. Comput. Syst., vol. 27, no. 6,pp. 743–756, Jun. 2011.

[20] Y. Simmhan, P. Groth, and L. Moreau, “Special section: The thirdprovenance challenge on using the open provenance model for interop-erability,” Future Generat. Comput. Syst., vol. 27, no. 6, pp. 737–742,Jun. 2011.

[21] E. G. Stephan, T. D. Halter, and B. D. Ermold, “Leveraging the openprovenance model as a multi-tier model for global climate research,”in Proc. 3rd Int. Provenance Annotation Workshop, vol. 6378. 2010,pp. 34–41.

[22] B. Plale, B. Cao, C. Herath, and Y. Sun, “Data provenance for preser-vation of digital geoscience data,” Geol. Soc. Amer. Special Papers,vol. 482, pp. 125–137, Feb. 2011.

[23] Y. Liu, J. Futrelle, J. Myers, A. Rodriguez, and R. Kooper,“A provenance-aware virtual sensor system using the open provenancemodel,” in Proc. Int. Symp. Collaborative Technol. Syst., May 2010,pp. 330–339.

[24] N. Kwasnikowska and J. V. den Bussche, “Mapping the NRC dataflowmodel to the open provenance model,” in Proc. 2nd Int. ProvenanceAnnotation Workshop, vol. 5272. Jun. 2008, pp. 3–16.

[25] X. Yang, J. D. Blower, L. Bastin, V. Lush, A. Zabala, J. Masó,D. Cornford, P. Díaz, and J. Lumsden, “An integrated view of dataquality in earth observation,” Phil. Trans. R. Soc. A, vol. 371, no. 1983,p. 20120072, Dec. 2012.

[26] G. Booch, J. Rumbaugh, and I. Jacobson, The Unified Modeling Lan-guage User Guide, 2nd ed. Reading, MA, USA: Addison-Wesley, 2005.

[27] W.-C. Tan, “Provenance in databases: Past, current, and future,” Bull.Tech. Committee Data Eng., vol. 32, no. 4, pp. 3–12, Apr. 2007.

[28] B. R. Barkstorm, “A mathematical framework for earth science dataprovenance tracing,” Earth Sci. Inf., vol. 3, no. 3, pp. 167–196,Sep. 2010.

[29] European Commission Joint Research Centre. (2010, Jun.).INSPIRE Metadata Implementing Rules: Technical GuidelinesBased on EN ISO 19115 and EN ISO 19119, Brussels, Belgium[Online]. Available: http://inspire.jrc.ec.europa.eu/documents/Metadata/INSPIRE_MD_IR_and_IS%O_v1_2_20100616.pdf

[30] Use of ISO Metadata Specifications to Describe Geoscience InformationResources. (2010, Nov.) [Online]. Available: http://repository.usgin.org/sites/default/files/dlio/files/2011/u11/usg%in_iso_metadata_1.1.3.pdf

[31] Geographic Information—Temporal Schema, ISO Standard ISO/FDIS19 108:2002(E), 2002.

[32] National Oceanic and Atmospheric Administration. (2012,Jan.) [Online]. Available: http://geo-ide.noaa.gov/wiki/index.php?title=ISO_Lineage

[33] G. Liu, A. E. Strong, and W. Skirving, “Remote sensing of sea surfacetemperature during 2002 barrier reef coral bleaching,” EOS, Trans. Amer.Geophys. Union, vol. 84, no. 15, pp. 137–144, Apr. 2003.

[34] NOAA Coral Reef Watch Operational 50-km SatelliteCoral Bleaching Degree Heating Weeks Product. (2013)[Online]. Available: http://coralreefwatch.noaa.gov/satellite/hdf/index.html

[35] Geographic Information—Metadata—XML Schema Implementation,ISO Standard 19 139:2007, 2007.

[36] J. Ellson, E. Gansner, L. Koutsofios, S. North, and G. Woodhull,“Graphviz—Open source graph drawing tools,” in Proc. 9th Int. Symp.Graph Draw., 2001, pp. 483–484.

[37] S. Ram and J. Liu, “A semantic foundation for provenance management,”J. Data Semant., vol. 1, no. 1, pp. 11–17, Jan. 2012.

Chen-Chieh Feng received the B.S. and M.S.degrees in geography from National Taiwan Uni-versity, Taipei, Taiwan, and the Ph.D. degree ingeography from the University at Buffalo, The StateUniversity of New York, Buffalo, NY, USA, in 1994,1996, and 2004, respectively.

He is currently an Assistant Professor with theDepartment of Geography, National University ofSingapore, Singapore. His current research interestsinclude the use of geographic information systemsand remote sensing for studying impacts of land use

land cover changes on public health, remote sensing data accuracy, and spatialdata mining and handling.