33
Accessing and serving scientific datasets with Python Dr. Rob De Almeida

PyCon 2007

Embed Size (px)

DESCRIPTION

Presentation given at PyCon 2007 in Dallas, TX, 2007, about pyDAP.

Citation preview

Page 1: PyCon 2007

Accessing and serving scientific datasets with Python

Dr. Rob De Almeida

Page 2: PyCon 2007

The Data Access Protocol

● De facto standard for distributing science data on the internet, used by oceanography, meteorology and climate communities

● Simple HTTP-based protocol with XDR encoding for data transmission

● Supports complex dataset structures● Model output, satellite images, in-situ data,

etc.

Page 3: PyCon 2007

Protocol details

● A dataset has different URLs describing it● http://server/dataset● http://server/dataset.dds (structure)● http://server/dataset.das (attributes)● http://server/dataset.dods (data)

● Client (usually) retrieves metadata from DDS/DAS responses and downloads data from DODS response as necessary

Page 4: PyCon 2007

A simple example

● Dataset with a list “a” of integers from 0 to 9

● Let's also add a few attributes: author, history

● What is the representation of metadata and data?

Page 5: PyCon 2007

Dataset Descriptor Structure

Dataset {

Int32 a[a = 10];

} test;

Page 6: PyCon 2007

Dataset Attribute Structure

Attributes {

a {

String author "Rob De Almeida";

String history "Created for PyCon 2007";

}

}

Page 7: PyCon 2007

DODS response

Dataset {

Int32 a[a = 10];

} test;

Data:

\x00\x00\x00\x0a\x00\x00\x00\x0a

\x00\x00\x00\x00\x00\x00\x00\x01

\x00\x00\x00\x02\x00\x00\x00\x03

\x00\x00\x00\x04\x00\x00\x00\x05

\x00\x00\x00\x06\x00\x00\x00\x07

\x00\x00\x00\x08\x00\x00\x00\x09

Page 8: PyCon 2007

Using pyDAP as a client

● The client retrieves and parses the metadata (DAS/DDS), building a dataset object with all the variables than can be introspected

● Data is downloaded on the fly when required

● Uses httplib2 and a custom-made xdrlib based on numpy or array

Page 9: PyCon 2007

Example usage

>>> from dap.client import open

>>> dataset = open('http://test.pydap.org/coads.nc', verbose=True)

http://test.pydap.org/coads.nc.dds

http://test.pydap.org/coads.nc.das

>>> print dataset.keys()

['UWND', 'WSPD', 'SST', 'VWND', 'SLP', 'AIRT', 'SPEH', 'COADSX', 'COADSY', 'TIME']

Page 10: PyCon 2007

Introspecting the dataset

>>> time = dataset['TIME']

>>> print time.type, time.shape, time.dimensions

Float64 (12,) ('TIME',)

>>> print time.units

>>> print time.units

hour since 0000-01-01 00:00:00

Page 11: PyCon 2007

Retrieving data

>>> print time[:]

http://test.pydap.org/coads.nc.dods?TIME[0:1:11]

[ 366. 1096.485 1826.97 2557.455 3287.94 4018.425 4748.91 5479.395 6209.88 6940.365 7670.85 8401.335]

>>> print time[0]

http://test.pydap.org/coads.nc.dods?TIME[0:1:0]

[ 366.]

>>> print time[-2:]

http://test.pydap.org/coads.nc.dods?TIME[10:1:11]

[ 7670.85 8401.335]

Page 12: PyCon 2007

Working with sequential data

Dataset {

Sequence {

Int32 id;

Float64 lat;

Float64 lon;

} test;

} test%2Ecsv;

http://test.pydap.org/test.csv.dds

Page 13: PyCon 2007

Retrieving data

>>> from dap.client import open

>>> dataset = open('http://test.pydap.org/test.csv', verbose=True)

http://test.pydap.org/test.csv.dds

http://test.pydap.org/test.csv.das

>>> seq = dataset['test']

>>> print seq['lat'][:]

http://test.pydap.org/test.csv.dods?test.lat

[10.1, 10.199999999999999, 10.300000000000001, 10.4, 10.5]

Page 14: PyCon 2007

Iterating over sequential data

>>> for struct in seq:

... print struct['lat'].data, struct['lon'].data

...

http://test.pydap.org/test.csv.dods?test.id

http://test.pydap.org/test.csv.dods?test.lat

http://test.pydap.org/test.csv.dods?test.lon

10.1 103.0

10.2 93.0

10.3 83.0

10.4 73.0

10.5 63.0

Page 15: PyCon 2007

Filtering sequences (sure way)

>>> fseq = seq.filter('%s<100' % seq.lon.id)

>>> for struct in fseq:

... print struct['lat'].data, struct['lon'].data

...

http://test.pydap.org/test.csv.dods?test.id&test.lon<100

http://test.pydap.org/test.csv.dods?test.lat&test.lon<100

http://test.pydap.org/test.csv.dods?test.lon&test.lon<100

10.2 93.0

10.3 83.0

10.4 73.0

10.5 63.0

Page 16: PyCon 2007

Filtering sequences (fun way!)

>>> fseq = (struct for struct in seq if struct['lon'] < 100)

>>> for struct in fseq:

... print struct['lat'].data, struct['lon'].data

...

http://test.pydap.org/test.csv.dods?test.id&test.lon<100

http://test.pydap.org/test.csv.dods?test.lat&test.lon<100

http://test.pydap.org/test.csv.dods?test.lon&test.lon<100

10.2 93.0

10.3 83.0

10.4 73.0

10.5 63.0

Page 17: PyCon 2007

Server

● pyDAP comes with a WSGI app that works as a DAP server

● Server is just a thin layer between plugins that handle data formats (netCDF, HFD5, SQL, etc.) and responses (DAS, DDS, DODS, HTML, KML, WMS, etc.)

● Can be deployed with Paster Script template:

● paster create -t dap_server myserver● paster server myserver/server.ini

Page 18: PyCon 2007

Plugins and responses

Page 19: PyCon 2007

Plugins and responses

http://localhost:8080/file.nc.das

Page 20: PyCon 2007

Plugins

● Convert data from different formats to pyDAP types

● Plugins for netCDF, CSV, Matlab 4/5, HDF5, GrADS grib, GDAL, DB API 2, grib2

● EasyInstall (entry point dap.plugin):● easy_install dap.plugins.netcdf

Page 21: PyCon 2007

Responses

● Convert from pyDAP types to something else

● “Official” responses: DAS, DDS, DODS● Generate data and metadata from the

dataset created by the plugins● Extra responses can be installed using

EasyInstall (entry point dap.response)

Page 22: PyCon 2007

ASCII response

Dataset { Sequence { Int32 id; Float64 lat; Float64 lon; } test;} test%2Ecsv;---------------------------------------------test.id, test.lat, test.lon1, 10.1, 1032, 10.2, 933, 10.3, 834, 10.4, 735, 10.5, 63

http://test.pydap.org/test.csv.ascii

Page 23: PyCon 2007

HTML response

● Generates an HTML form to download data

● Redirects user to ASCII response● Useful for users without a DAP client

Page 24: PyCon 2007

Example HTML response

Page 25: PyCon 2007

JSON response

{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",

"test": {"attributes": {}, "type": "Sequence", "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}

http://test.pydap.org/test.csv.json

Page 26: PyCon 2007

JSON response with data

{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",

"test": {"attributes": {}, "type": "Sequence", "data": [[1, 10.1, 103.0], [2, 10.2, 93.0], [3, 10.3, 83.0], [4, 10.4, 73.0], [5, 10.5, 63.0]], "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}

http://test.pydap.org/test.csv.json?output_data=1

Page 27: PyCon 2007

WMS response

● Returns maps (images) from requested variables and regions

● Works with geo-referenced grids and sequences

● Layers can be composed together● Data can be constrained:

● /coads.nc.wms?SST // annual mean● /coads.nc.wms?SST[0] // january

Page 28: PyCon 2007

WMS example request

http://localhost:8080/netcdf/coads.nc.wms?LAYERS=SST&WIDTH=512

Page 29: PyCon 2007

KML response

● Generates XML file using the Keyhole Markup Language, pointing to the WMS response

● Nice and simple interface for quick visualizing data

Page 30: PyCon 2007
Page 31: PyCon 2007
Page 32: PyCon 2007

Future

● pyDAP 2.3 almost ready● Dapper compliance● Faster XDR encoding/decoding● Initial support for DDX response and parser

● Build a rich web interface (AJAX) based on JSON + WMS + KML responses

● Not only to pyDAP, but to other OPeNDAP servers using pyDAP as a proxy

Page 33: PyCon 2007

Acknowledgments

● OPeNDAP for all the support● PSF for the financial support to be here● Everybody who submitted bugs (bonus

points for submitting patches!)