Upload
rob-de-almeida
View
324
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Presentation given at PyCon 2007 in Dallas, TX, 2007, about pyDAP.
Citation preview
Accessing and serving scientific datasets with Python
Dr. Rob De Almeida
The Data Access Protocol
● De facto standard for distributing science data on the internet, used by oceanography, meteorology and climate communities
● Simple HTTP-based protocol with XDR encoding for data transmission
● Supports complex dataset structures● Model output, satellite images, in-situ data,
etc.
Protocol details
● A dataset has different URLs describing it● http://server/dataset● http://server/dataset.dds (structure)● http://server/dataset.das (attributes)● http://server/dataset.dods (data)
● Client (usually) retrieves metadata from DDS/DAS responses and downloads data from DODS response as necessary
A simple example
● Dataset with a list “a” of integers from 0 to 9
● Let's also add a few attributes: author, history
● What is the representation of metadata and data?
Dataset Descriptor Structure
Dataset {
Int32 a[a = 10];
} test;
Dataset Attribute Structure
Attributes {
a {
String author "Rob De Almeida";
String history "Created for PyCon 2007";
}
}
DODS response
Dataset {
Int32 a[a = 10];
} test;
Data:
\x00\x00\x00\x0a\x00\x00\x00\x0a
\x00\x00\x00\x00\x00\x00\x00\x01
\x00\x00\x00\x02\x00\x00\x00\x03
\x00\x00\x00\x04\x00\x00\x00\x05
\x00\x00\x00\x06\x00\x00\x00\x07
\x00\x00\x00\x08\x00\x00\x00\x09
Using pyDAP as a client
● The client retrieves and parses the metadata (DAS/DDS), building a dataset object with all the variables than can be introspected
● Data is downloaded on the fly when required
● Uses httplib2 and a custom-made xdrlib based on numpy or array
Example usage
>>> from dap.client import open
>>> dataset = open('http://test.pydap.org/coads.nc', verbose=True)
http://test.pydap.org/coads.nc.dds
http://test.pydap.org/coads.nc.das
>>> print dataset.keys()
['UWND', 'WSPD', 'SST', 'VWND', 'SLP', 'AIRT', 'SPEH', 'COADSX', 'COADSY', 'TIME']
Introspecting the dataset
>>> time = dataset['TIME']
>>> print time.type, time.shape, time.dimensions
Float64 (12,) ('TIME',)
>>> print time.units
>>> print time.units
hour since 0000-01-01 00:00:00
Retrieving data
>>> print time[:]
http://test.pydap.org/coads.nc.dods?TIME[0:1:11]
[ 366. 1096.485 1826.97 2557.455 3287.94 4018.425 4748.91 5479.395 6209.88 6940.365 7670.85 8401.335]
>>> print time[0]
http://test.pydap.org/coads.nc.dods?TIME[0:1:0]
[ 366.]
>>> print time[-2:]
http://test.pydap.org/coads.nc.dods?TIME[10:1:11]
[ 7670.85 8401.335]
Working with sequential data
Dataset {
Sequence {
Int32 id;
Float64 lat;
Float64 lon;
} test;
} test%2Ecsv;
http://test.pydap.org/test.csv.dds
Retrieving data
>>> from dap.client import open
>>> dataset = open('http://test.pydap.org/test.csv', verbose=True)
http://test.pydap.org/test.csv.dds
http://test.pydap.org/test.csv.das
>>> seq = dataset['test']
>>> print seq['lat'][:]
http://test.pydap.org/test.csv.dods?test.lat
[10.1, 10.199999999999999, 10.300000000000001, 10.4, 10.5]
Iterating over sequential data
>>> for struct in seq:
... print struct['lat'].data, struct['lon'].data
...
http://test.pydap.org/test.csv.dods?test.id
http://test.pydap.org/test.csv.dods?test.lat
http://test.pydap.org/test.csv.dods?test.lon
10.1 103.0
10.2 93.0
10.3 83.0
10.4 73.0
10.5 63.0
Filtering sequences (sure way)
>>> fseq = seq.filter('%s<100' % seq.lon.id)
>>> for struct in fseq:
... print struct['lat'].data, struct['lon'].data
...
http://test.pydap.org/test.csv.dods?test.id&test.lon<100
http://test.pydap.org/test.csv.dods?test.lat&test.lon<100
http://test.pydap.org/test.csv.dods?test.lon&test.lon<100
10.2 93.0
10.3 83.0
10.4 73.0
10.5 63.0
Filtering sequences (fun way!)
>>> fseq = (struct for struct in seq if struct['lon'] < 100)
>>> for struct in fseq:
... print struct['lat'].data, struct['lon'].data
...
http://test.pydap.org/test.csv.dods?test.id&test.lon<100
http://test.pydap.org/test.csv.dods?test.lat&test.lon<100
http://test.pydap.org/test.csv.dods?test.lon&test.lon<100
10.2 93.0
10.3 83.0
10.4 73.0
10.5 63.0
Server
● pyDAP comes with a WSGI app that works as a DAP server
● Server is just a thin layer between plugins that handle data formats (netCDF, HFD5, SQL, etc.) and responses (DAS, DDS, DODS, HTML, KML, WMS, etc.)
● Can be deployed with Paster Script template:
● paster create -t dap_server myserver● paster server myserver/server.ini
Plugins and responses
Plugins and responses
http://localhost:8080/file.nc.das
Plugins
● Convert data from different formats to pyDAP types
● Plugins for netCDF, CSV, Matlab 4/5, HDF5, GrADS grib, GDAL, DB API 2, grib2
● EasyInstall (entry point dap.plugin):● easy_install dap.plugins.netcdf
Responses
● Convert from pyDAP types to something else
● “Official” responses: DAS, DDS, DODS● Generate data and metadata from the
dataset created by the plugins● Extra responses can be installed using
EasyInstall (entry point dap.response)
ASCII response
Dataset { Sequence { Int32 id; Float64 lat; Float64 lon; } test;} test%2Ecsv;---------------------------------------------test.id, test.lat, test.lon1, 10.1, 1032, 10.2, 933, 10.3, 834, 10.4, 735, 10.5, 63
http://test.pydap.org/test.csv.ascii
HTML response
● Generates an HTML form to download data
● Redirects user to ASCII response● Useful for users without a DAP client
Example HTML response
JSON response
{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",
"test": {"attributes": {}, "type": "Sequence", "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}
http://test.pydap.org/test.csv.json
JSON response with data
{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",
"test": {"attributes": {}, "type": "Sequence", "data": [[1, 10.1, 103.0], [2, 10.2, 93.0], [3, 10.3, 83.0], [4, 10.4, 73.0], [5, 10.5, 63.0]], "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}
http://test.pydap.org/test.csv.json?output_data=1
WMS response
● Returns maps (images) from requested variables and regions
● Works with geo-referenced grids and sequences
● Layers can be composed together● Data can be constrained:
● /coads.nc.wms?SST // annual mean● /coads.nc.wms?SST[0] // january
WMS example request
http://localhost:8080/netcdf/coads.nc.wms?LAYERS=SST&WIDTH=512
KML response
● Generates XML file using the Keyhole Markup Language, pointing to the WMS response
● Nice and simple interface for quick visualizing data
Future
● pyDAP 2.3 almost ready● Dapper compliance● Faster XDR encoding/decoding● Initial support for DDX response and parser
● Build a rich web interface (AJAX) based on JSON + WMS + KML responses
● Not only to pyDAP, but to other OPeNDAP servers using pyDAP as a proxy
Acknowledgments
● OPeNDAP for all the support● PSF for the financial support to be here● Everybody who submitted bugs (bonus
points for submitting patches!)