Preserving Digital Geospatial Data:
The NC Geospatial Data Archiving Project (NCGDAP)
Steven P. MorrisNorth Carolina State University Libraries
CRADLE Seminar November 17, 2006
Note: Percentages based on the actual number of respondents to each question 2
NC Geospatial Data Archiving Project
Partnership between university library (NCSU) and state agency (NCCGIA)Focus on state and local geospatial data in North Carolina (state demonstration)Tied to NC OneMap initiative, which provides for seamless access to data, metadata, and inventoriesObjective: engage existing state/federal geospatial data infrastructures in preservationProject approaches: Technical and Social
Serve as catalyst for discussion within industry
Note: Percentages based on the actual number of respondents to each question 3
Targeted data: Digital orthophotography
85+ NC counties with orthophotos1-5 flights per county30-200 gb per flight
Note: Percentages based on the actual number of respondents to each question 4
Targeted data: Vector data (w/tabular)
Economic, infrastructure, and ethnographic data
Note: Percentages based on the actual number of respondents to each question 5
Today’s geospatial data as tomorrow’s cultural heritage
Future uses of data are difficult to anticipate (as with Sanborn Maps).
Note: Percentages based on the actual number of respondents to each question 6
Risks to State/Local Geospatial Data
Producer focus on current dataData overwrite as common practice
Future support of data formats in question
No open, supported format for vector data
Shift to web services-based accessData becoming more ephemeral
Inadequate or nonexistent metadataImpedes discovery and use
Increasing use of spatial databases for data management
The whole is greater than the sum of the parts
Note: Percentages based on the actual number of respondents to each question 7
Challenge: Vector Data Formats
No widely-supported, open vector formats for geospatial data
Spatial Data Transfer Standard (SDTS) not widely supportedGeography Markup Language (GML) – diversity of application schemas and profiles threatens permanent access
Spatial DatabasesThe sum is more than the whole of the parts, and the sum is very difficult to preserveCan export individual data layers for curationSome thinking of using the spatial database as the primary archival platform
Note: Percentages based on the actual number of respondents to each question 8
Challenge: Cartographic Representation
Counterpart to the map is not just the dataset but also models, symbolization, classification, annotation, etc.
Note: Percentages based on the actual number of respondents to each question 9
Challenge: Geospatial Web Services
• How to capture records from decision- making processes?• Possible: Atlas collections from automated image capture• Web 2.0 impact: Emerging tiling and caching schemes (archive target?)
Note: Percentages based on the actual number of respondents to each question 10
Different Ways to Approach Preservation
Technical solutions: How do we archive acquired content over the long term?
Build a data repository: not as an end in itself but as a catalyst for discussion within the data communityDevelop a repository ingest workflow: create technical points of engagement with the digital preservation community
Note: Percentages based on the actual number of respondents to each question 11
Different Ways to Approach Preservation
Cultural/Organizational solutions: How do we make the data more preservable—and more prone to be archived—from point of production?
Engage data producer community and spatial data infrastructure through outreach and engagement; influence practiceSell the problem to software vendors and standards developmentFind overlap with more compelling business problems: disaster preparedness, business continuity, road building, etc.Start a discussion about roles at the local, state, and federal level
Note: Percentages based on the actual number of respondents to each question 12
NCGDAP Technical Approach
Receive data as is – variety of distribution methodsMigration of some at-risk formatsMetadata remediation, normalization, and synchronizationDistilling complex objects into repository ingest items (not easy)Using DSpace for demonstration purposes (keeping repository platform at arms length)In the development: use METS record as dormant item “brain” within the repository
Some unsustainable activities – for learning experience
Note: Percentages based on the actual number of respondents to each question 13
Building Data Bundles: The Zip Codes Example
Note: Percentages based on the actual number of respondents to each question 14
Where is the Dataset?
Note: Percentages based on the actual number of respondents to each question 15
Here’s One!
Files
• Multi-file dataset• Georeferencing• Metadata file• Symbolization file• Additional documentation• License• Disclaimer• More
Metadata
• FGDC• Acquisition metadata• Transfer metadata • Ingest metadata• Archive rights• Archive processes• Collection metadata• Series metadata
Note: Percentages based on the actual number of respondents to each question 16
Hub-and-Spoke Metadata Workflow
Note: Percentages based on the actual number of respondents to each question 17
Hub-and-Spoke Metadata Workflow
Note: Percentages based on the actual number of respondents to each question 18
Hub-and-Spoke Metadata Workflow
Issues:
• Ingest process needs access to repository specifics (e.g., what collections exist)
• Understanding of what the core elements should be is refined as spokes are added
• Need to consider repository response to SIP or AIP evolution
Note: Percentages based on the actual number of respondents to each question 19
Metadata: Going Beyond a Passive Role
Feedback to the NC OneMap Metadata Outreach Program vis-à-vis metadata quality problems encountered in repository ingestEngage standards body (Open Geospatial Consortium -- OGC) in discussions about:
content packaging standards for geospatial better practices for time-versioned data persistent identifier schemes contributing archive use cases to GeoDRM
Meetings with major software vendor development teams
Note: Percentages based on the actual number of respondents to each question 20
Social Issues: Changing Industry Thinking
Is the geospatial industry “temporally-impaired?”Lack of access to older dataLack for tool/model support for temporal analysisMetadata: poor support for changing dataEducation: building class projects around available data (i.e., not temporal)
Increased interest now in temporal applications?Increased demand for temporal data?Improved tool support: ArcGIS 9.2 animation tools; Geodatabase History, etc.
IMPORTANT: Gathering business cases for using older data
Note: Percentages based on the actual number of respondents to each question 21
Social Issues: Content Exchange Networks
Solving the present-day problems of data sharing is a pre-requisite to solving the problem of long-term accessLeveraging more compelling business problems: disaster preparedness and business continuity needs can put the data in motion (siphon off to the archive)Geospatial data: large data volumes, frequent data update, complex datasets, ambiguous rightsContent exchange network technical challenges:
Rights managementLarge-scale transfers on networkContent packaging (MPEG 21 DIDL, XFDU, METS, …)
Note: Percentages based on the actual number of respondents to each question 22
Content Issues: Frequency of Capture Survey
Survey objective:Document current practices for obtaining archival snapshots of county/municipal geospatial vector data layersSeek guidance about frequency of capture
Survey topics:General questions about data archiving practiceSpecific questions about parcels, street centerlines, jurisdictional boundaries, and zoning
Survey subjects:All 100 counties and 25 municipalities -- 58% response rateSurvey conducted September 2006
Added benefit: Survey socialized the preservation issue
Note: Percentages based on the actual number of respondents to each question 23
NC County/Municipal Agency Frequency of Capture: Parcel Data
42%
9%9%
14%
13%
13% Annually
Every 6 Months
Quarterly
Monthly
Weekly or Daily
Not Saved
Based on a percentage of the respondents that indicate they actually archive some data
Note: Percentages based on the actual number of respondents to each question 24
Project Status Cultivating a commercial market for older data.
Part of “permanent access” is marketing, advertising, and putting older data into the path of the user
Content Issues: What About Commercial Data?
Note: Percentages based on the actual number of respondents to each question 25
Mobile, LBS and, social networking applications drive demand for placed-based dataExample sources:
Oblique ImageryStreet-view Imagery (e.g., A9.com)Transportation Dept. Videologs
Long-term cultural heritage value in non-overhead imagery: more descriptive of place and function
New Challenges:“Platial” vs. Spatial Imagery
Emerging: “Tricorder” applications
Note: Percentages based on the actual number of respondents to each question 26
Emerging online environments are increasingly used to make decisions, how are these decisions documented?Web mashup/AJAX interactions with existing systems spur creation of intermediate content layers: e.g., tiling and caching of WMS servicesFormulation of a standard tiling scheme may create a new preservation opportunity (temporal axis on caches?)
New Challenges: Ajax Applications, Google Earth and All That
Note: Percentages based on the actual number of respondents to each question 27
• Web mashup/AJAX interactions with existing systems spur creation of intermediate content layers: e.g., tiling and caching of WMS services
• Identification of a standard tiling scheme may create a new preservation opportunity (temporal axis on caches?)
Note: Percentages based on the actual number of respondents to each question 28
Working with New PartnersState Archives now an informal member of the NCGDAP projectCollaboration with NARAWorking with the Open Geospatial Consortium on standards issuesAssociate Partnership with JISC-funded UK-wide projectSite visits with ESRI (major software vendor) development groupsParticipation in a variety of content exchange network activitiesMore …
Note: Percentages based on the actual number of respondents to each question 29
Next StepsWorking with NARA and the OGC Interoperability Institute to develop an OGC Data Preservation Working Group charter Evaluating results for the frequency of capture surveyStepping up data acquisition and repository ingestEvaluating initial data acquisition efforts (time factors, content variety, technical/legal barriers)Partnership with content exchange network activitiesRamping up partnerships with broader (non-geospatial) data repository efforts
Note: Percentages based on the actual number of respondents to each question 30
Questions?
Contact:
Steve MorrisHead, Digital Library InitiativesNCSU Librariesph: (919) [email protected]
http://www.lib.ncsu.edu/ncgdap