The need of Interoperability in Office and GIS formats

Embed Size (px)

Citation preview

GFOSS04 interoperability

Free GIS and Interoperability
GIS Open Source, interoperabilit e cultura del dato
nei SIAT della Pubblica Amministrazione

[GIS Open Source, interoperability and the 'culture of data'
in the spatial data warehouses of the Public Administration]GFOSS'04
ITC-irst, 16 Nov 2004
(last revised 10 2005)

M. Neteler
neteler at itc ithttp://mpa.itc.it

ITC-irst, Povo (Trento), Italy

The need for Interoperability

The problem

nowadays data have to be exchanged across often very heterogeneous groups

the personal choice of application software/operating system should not affect
the data exchange

data exchange standards are available

limited awareness for the need of interoperability

limited implementation of interoperability in processes and software

commonly used file formats let to believe in interoperability: false friends

What are Standardization & Interoperability?

Standardization versus Interoperability

Standardization: Written/published document describing data formats, models etc.

Example Office Standards: ASCII, HTML, XML, ...
Example GIS Standards: GML, ISO 08211, ISO/IEC 15444-1, WMS etc.

Only published standards are acceptable.


Interoperability: More than application of standardization, it also comprises the
interpretation of the standard (sometimes definitions are incomplete)

Desired: Lossless transfer of static or dynamic data between
- different users, systems, applications, and
- different operating systems, platforms.

Interoperability?

The two dimensions of Interoperability

Longitudinal Interoperability: time - long term storage

Data shall be readable over time (years, decades, ...).This is of particular interest for data of public administration
and long-term projects.

Transversal Interoperability: sharing data between users

Data shall be readable across user communities, independentfrom software or operating system used (freedom of software choice).Again, this is of particular interest for data of public administration
and long-term projects.

Part I: Office Interoperability

Example: MS-Word .DOC format

Are WORD.doc files a suitable for data exchange?

the format is undocumented, to some extend it was reverse-engineered
does not support transversal interoperability

the format is regularly changed (Word 1, 2, 95, 97, NT, 2000, XP, ...
also named WinWORD 6, 8, 10,...)
does not support longitudinal interoperability

Prone to MS-Windows macro viruses

severe security/privacy issues (example next slide)
- DOC files contain sensitive information about user (unrelated
to the contents)
- deleted text may still be legible outside of MS-Word

contents cannot be completely verified

Example: MS-Word .DOC format - security/privacy issues

Descrambling a WORD.doc file

Your unique MS-Windows user ID (or similar):
PID_GUIDAN{714738E3-FF4C-11D3-ZD7C-00E0281D67A7}
This makes your (anonymous) document traceable.

Sometimes delete text is still visible (think of re-using an existing WORD file)

A famous example:
In February 2003, the British government of Tony Blair published a dossier on
Iraq's security and intelligence organizations. This dossier was cited by
Colin Powell in his address to the United Nations the same month.
Dr. Glen Rangwala, a lecturer in politics at Cambridge University, quickly
discovered that much of the material in the dossier was actually plagiarized
from a U.S. researcher on Iraq.
http://www.computerbytesman.com/privacy/blair.htm

# in any UNIX/Linux system, simply run: tr -d [:cntrl:] < wordfile.docWhat you may find:

Descrambling a WORD.doc file: The British Iraq dossier 2003 1/2

http://nytimes.com

Example: MS-Word .DOC format - security/privacy issues

[neteler@dandre2 gfoss04]$ tr -d [:cntrl:] < blair.doc>z|y [...]-xxxx-o#o#{'?^,k6-* RuG (-$IRAQ ITS INFRASTRUCTURE OF CONCEALMENT,DECEPTION AND INTIMIDATIONThis report draws upon a number of sources, including intelligence material, and shows how the Iraqi regime is constructed to have, and to keep, WMD, and is now engaged in a campaign of obstruction of the United Nations Weapons Inspectors.[...][`azbhhhh?h-i/isjcic22JC:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - security.asdcic22JC:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - security.asdcic22JC:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - security.asdJPrattC:\TEMP\Iraq - security.docJPrattA:\Iraq - security.docablackshaw!C:\ABlackshaw\Iraq - security.docablackshaw#C:\ABlackshaw\A;Iraq -security.docablackshawA:\Iraq - security.docMKhanC:\TEMP\Iraq - security.docMKhan(C:\WINNT\Profiles\mkhan\Desktop\Iraq.docPjzXV*uzLl_bzLl_[...]jP@GTimes New Roman5SymbolG&ArialHelveticaA&Arial Narrow?&ArialBlack"qh_r&r&aq#JV,?RVW,!??20di?fCIraq- ITS INFRASTRUCTURE OFCONCEALMENT, DECEPTION AND INTIMIDATIONdefaultMKhanOh+'0?4DPlx??DIraq- ITS INFRASTRUCTURE OF CONCEALMENT, DECEPTION ANDINTIMIDATIONraqdefaultefaefaNormal.dotNMKhan.d4haMicrosoft Word 8.0C@Ik@n)@"Zf@du#JV[...]

http://www.computerbytesman.com/privacy/blair.htm

- "cic22" stands for "Communications Information Centre," a unit of the
British Government- Paul Hamill - Foreign Office official
- John Pratt - Downing Street official
- Alison Blackshaw - The personal assistant of the Prime Minister's press secretary
- Murtaza Khan - Junior press officer for the Prime Minister

Weapons of mass destruction

Descrambling a WORD.doc file: The British Iraq dossier 2003 2/2

Example: MS-Word .DOC format - security/privacy issues

Example: MS-Excel .XLS format

Are EXCEL.xls files a suitable for data exchange?

the format is undocumented, to some extend it was reverse-engineered
does not support transversal interoperability

the format is regularly changed (Excel 95, 97, NT, 2000, ...)
does not support longitudinal interoperability

Prone to MS-Windows viruses

Limitation: max. 65535 lines in a table (216)

Auto-conversion feature risky: Some fields/columns are automatically changed to
date-time format (see example next slides)
risk of accidental data damage high

Example: MS-Excel .XLS format accidental data damage

The Human Genome Project case 1/3

In 2004 scientists discovered that some gene names were being changed
inadvertently to non-gene names. Citation:

A little detective work traced the problem to default date format conversions and
floating-point format conversions in the very useful Excel program package.
The date conversions affect at least 30 gene names; the floating-point conversions
affect at least 2,000 if Riken identifiers are included. These conversions are
irreversible; the original gene names cannot be recovered.
A default date conversion feature in Excel (Microsoft Corp., Redmond, WA) was
altering gene names that it considered to look like dates. For example, the tumor
suppressor DEC1 [Deleted in Esophageal Cancer 1] [3] was being converted
to '1-DEC.'

Cited after:
B.R. Zeeberg, J. Riss, D.W. Kane, K.J. Bussey, E. Uchio, W.M. Linehan,
J.C. Barrett and J.N. Weinstein, BMC Bioinformatics 2004, 5:80
http://dx.doi.org/10.1186/1471-2105-5-80

The Human Genome Project case 2/3

Example: MS-Excel .XLS format accidental data damage

http://dx.doi.org/10.1186/1471-2105-5-80

The Human Genome Project case 3/3

Example: MS-Excel .XLS format accidental data damage

http://dx.doi.org/10.1186/1471-2105-5-80

Suggestions for Office data interoperability

Text files:ASCII, HTML, RTF, XML, Latex
Postscript/PDF for read-only documents

Tables:CSV, xBase (dBase), XML

Databases:SQL92-ASCII

Bibliography:BibTex

Use documented ASCII formats instead of undocumented binary formats
(disk space is not an issue today)

Files can be compressed later (deflate compression, which is supported
by all common compression tools and Web browsers)

Suggestions for Office data interoperability

Automated conversion tools can be used to provide all formatsText files:ASCII, HTML, RTF, XML
Postscript/PDF

Tables:CSV, xBase (dBase), XML

Databases:SQL92-ASCII

Bibliography:BibTex

Converters (examples): OpenOffice.org [1]

wvWare [2[

OpenOffice.org, xbase2pg [3]

ODBC, xbase2pg

Bibutils [4]

Bibtex2html [5], (Endnote)

[1] http://OpenOffice.org itself uses XML as own standard format[2] http://wvware.sourceforge.net/[3] http://www.klaban.torun.pl/prog/pg2xbase/[4] http://www.scripps.edu/~cdputnam/software/bibutils/bibutils.html[5] http://www.lri.fr/~filliatr/bibtex2html/

OASIS: Office data interoperability

Promotion of Open Document Exchange Format Proposed and implemented new open standard format:
OASIS OpenDocument XML format

The OASIS OpenDocument format [1] is a vendor and implementation independent
file format which guarantees freedom and independence

E.g., OpenOffice.org uses OASIS as default format from version 2.0 onwards as well
as KOffice, StarOffice software and other vendors

The OASIS OpenDocument file format is one of the file formats
recommended by the European Commision [2]

[1] http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
[2] http://europa.eu.int/idabc/en/document/3439

Part II: GIS Interoperability

GIS Standards and Organizations

GIS data sets are more than geometry: Metadata- geographic reference- colors, display attributes etc- history of data modifications

GRASS Interagency
Steering CommiteeOpen GIS Open Geospatial
Consortium (OGC) Consortium (OGC)1990

1992

2004

Open GRASS
Foundation (OGF)1994

ISO/TC 2111997

WMS etcGMLhttp://www.opengeospatial.org

De-facto standard for GIS formats

Abstraction layer

GIS Interoperability: GDAL and OGR libraries

Data abstraction GDAL

GDAL
Raster OGR
Vector

http://www.gdal.org

Abstraction layerENVIGeoTIFFSARGRASSECWHDF4JPEG2000MrSIDArcGRIDMetadata- Number of bands- Color table- ...

- Coordinate system- Projection

40 FrmtsEPSG
Codes

PROJ.4

Abstraction layer

GIS Interoperability: GDAL and OGR libraries

Data abstraction OGR

GDAL
Raster

OGR
Vector

Metadata

- Coordinate system- ProjectionAbstraction layerEPSG
Codes

ArcCoverMITABOracleSHAPEPostGISGeodatabaseDGN20 Frmts

PROJ.4http://www.gdal.org/ogr/

GIS Data formats and support question

GDAL Development: Raster formats

Direct fundings:- Atlantis (ENVISAT, MFF, HKV Blobs)- eCognition Germany (FUJI BAS Format)- Los Alamos Nat. Labs (FITS)- OPeNDAP Inc. (OPeNDAP/DODS)- PeopleSoft (ERDAS LAN)- Safe Software (USGS SDTS, ISO8211 support)- Yukon Department of Environment (USGS DEM)

Public formats/Open documents/Reverse engineered- ERDAS Imagine (IMG)- ERMAPPER (ECW)- ESRI formats (ArcGrid)- GDAL Virtual Format- JasPer (JPEG2000); Kakadu (GeoJP2 interface for JPEG2000 = ISO/IEC 15444-1)- LizardTech (MrSID, JPEG2000)- NOAA (AVHRR data)

GIS Data formats and support question

OGR Development: Vector formats

Direct fundings: - DM Solutions Group and GoMOOS (SQLite RDBMS, Comma Sep. Values CSV) - OPeNDAP Inc. (OPeNDAP/DODS) - Safe Software (FMEObjects) - SRC, LLC (Oracle Spatial)

Public formats/Open documents/Reverse engineered- ESRI (SHAPE, ArcCoverage)- GML- IHO S-57- MapInfo (TAB and MIF/MID)- Microsoft (ODBC OGR)- Microstation (DGN)- MySQL (non-spatial data)

OGC Simple Features
Conformance

GRASS topological model

OGR

- OGDI Vectors (VMAP) - OGR Virtual Format - PostgreSQL/PostGIS - SDTS - UK Ordnance Survey (NTF) - U.S. Census (TIGER)

GIS formats

Why so many formats? No big problem!

Application specific requirements, which partially contradict each other

high compression rate

small runtime storage requirements

coding without information loss

fast decoding

easy access to pixels

simple algorithm

Hardware-/CPU-independence

Good software can handle numerous formats.

Software patents and rights of third parties: future traps ?!

GIS formats and Software Patents

How software patents affect GIS users

LZW (Lempel Ziv Welch) Compression Used in many raster formats (e.g. GIF)

Integrated into GRASS before it became patent, later replaced by Zlib Deflate

Unisys started to charge for usage after waiting some years

MrSID (Multi-resolution Seamless Image Database) wavelet based image file format

three patents covering both the image compression and on the fly
image decompression technology

GDAL support MrSID but requires MrSID SDK license

ECW (ERMAPPER Compressed Wavelets) Patent pending

GPL released source code available (of patented code?)

JPEG 2000 Situation not very clear

Public administration must take care
to avoid patent and license traps.

Summary

The personal choice of application software/operating system should not affect
the data exchange

longitudinal and transversal interoperability must be granted

Only documented formats may be used

There is no excuse: start to use interoperable formats today

GIS interoperability is at a better state than Office documents interoperability

Interoperability awareness needs to be promoted: today and in future

License of this document

Document home:
http://mpa.itc.it/gfoss04/neteler_gfoss04_interoperability2005.pdf



This work is licensed under a Creative Commons License.
http://creativecommons.org/licenses/by-sa/2.0/deed.en Free GIS and Interoperability, 2004-2005 Markus Neteler
[ OpenOffice SXI file available upon request: neteler at itc it neteler at osgeo org ]
License details: Attribution-ShareAlike 2.0 You are free: to copy, distribute, display, and perform the work

to make derivative works

to make commercial use of the work

Under the following conditions: Attribution. You must give the original author credit.
Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work
only under a license identical to this one.
For any reuse or distribution, you must make clear to others the license terms of this work.
Any of these conditions can be waived if you get permission from the copyright holder.
Your fair use and other rights are in no way affected by the above.

Markus Neteler ITC-irst 2004, 2005