Web delivery of giant climate data sets to facilitate open science

Embed Size (px)

Citation preview

pcic_template

Web delivery of giant climate data sets to
facilitate open science

James HiebertPacific Climate Impacts ConsortiumUniversity of VictoriaVictoria, BC Canada

FOSS4G NAMay 22, 2013

Introduction

How to serve...

high-resolution,downscaled,climate dataon the web?

My name is James Hiebert from the Pacific Climate Impacts Consortium in British Columbia, and I'm going to speak about my experience putting high-resolution, downscaled, climate data on the web. Give me a few slides to unpack that statement back to front.

Pacific Climate Impacts Consortium
(PCIC)

Regional climate services provider

Non-profit hosted at the University of Victoria

Fill a niche between pure research and

applied science

First of all I work for the Pacific Climate Impacts Consortium (PCIC), a regional climate service provider in British Columbia hosted at the University of Victoria. We are a publicly funded organization whose mandate is to conduct and facilitate regional climate research and deliver regional climate services.

Consortium Partners

As such we work directly with local and provincial governments, providing information on the physical science of climate change in support of the development of adaptation strategies.

Consortium Partners

Our stakeholders use our climate projections to answer a wide array of pertinent questions. Examples include whether future river flows can support hydro power...

Consortium Partners

whether future storm intensity will necessitate larger culverts, storm drains, or bridges...

Consortium Partners

to what degree sea level rise might inundate our homes and farmland (this photos is just a couple minute walk from my office)...

Consortium Partners

or whether our forests might become more susceptible to fire...

Consortium Partners

or outbreaks of disease.

Our stakeholders are potentially making policy decisions and engineering decisions for incredibly expensive infrastructure based on the results of our impacts models. We have a responsibility to openly provide the results of each phase of the climate modeling pipeline. This provides a higher degree of transparency, and also facilitates a greater degree of stakeholder engagement with a tighter feedback loop. Think Agile development, but for science.

High-resolution

How to serve...

high-resolution,downscaled,climate dataon the web?

To answer these kinds of questions requires us to have information at a very small scale, which brings us to high-resolution. As applied to climate data, the term high-resolution is a little difficult to define. For impacts models, our users typically require landscape level information or better. So high is a relative term that usually means as high as you can get. But allow me a brief interlude to first explain where climate data comes from and how we get to landscape level information.

Global Climate Models (GCMs)

IPCC AR4 Report, 2007

Modeling centers around the world develop and run what are called Global Climate Models or (GCMs). GCMs are derived from fundamental physical laws and include a variety of features that affect the climate system.

GCM Resolution

IPCC AR4 Report, 2007

They run at relatively coarse scale, one pixel representing a few hundred kilometers square. The output from these GCMs can then be taken and transformed to a finer resolutions, sometimes using statistical relationships, sometimes by running the model again on a smaller domain.

Global to Regional Scale

This transformation step is a process referred to as downscaling. The result of this downscaling step is what is used to drive impacts models and that is our target data for distribution.

Downscaling

How to serve...

high-resolution,downscaled,climate dataon the web?

Downscaling

GCMs(100-500 km)RCMs(15-50 km)Downscaling(1-4 km)Impacts Models

This is a very simplified flow diagram of the climate data pipeline, but to be clear, there is a lot of sophisticated physics, climate science, statistics, machine learning and other domains that go into each of these steps, each which have their own set of scientific questions to be answered and associated uncertainties.

Moving on

How to serve...

high-resolution,downscaled,climate dataon the web?

So hopefully I have explained these parts well enough for you to get by and we can move on to putting the data on the web.

Technical Challenges

Data volume

Irregular grids and projections

Large dimensionality

It turns out that it's hard to do! There are a variety of technical challenges that are worth mentioning.

Challenges: Data Volume

n emission scenariosxm GCMsxo RCMsxp downscaling methodsxvariables xspacextime

Big data

If you were paying attention though that last slide, you may have thought, hey, that sounds like a lot of moving parts. If you consider that climate scientists are modeling some number of future scenarios for human emissions, multiply that by all of the different global climate models, by all of the different regional climate models, by all of the different types of downscaling, by some number of measured quantities, by time, by space... all of a sudden we're talking about a lot of data.

You see, unlike observations, we're not limited by the sensors or satellites that we can afford to deploy. We can just create data out of thin air... or at least out of a model that is churning away at simulating the earth and the ocean.

We're not even limited by what has happened in the past. We are projecting the future. And not just one future, but many, many realizations of the future. As modeled by many people and many modeling centers.

Challenges: Grids and Projections

Additional challenges include that model grids are often irregular, curvilinear or in projections that aren't supported by FOSS that we want to use. And in simulating global climate, the poles and high latitudes are often very important to the climate system. But those same places are often overlooked by popular, well-supported projections like web mercator since few people lives there.

Challenges: Large dimensionality

Also, the geospatial component of the data is not the only component. Maps are usually what we use in this community as our final outputs, but with all the different permutations of our data, this would end up being tens of thousands of maps, or billions if you consider the full time dimension.

Demo

With those challenges in mind, let's get back to the web and I'm going to go though a demo of what we developed to accommodate data delivery to our users. I'll go through this demonstration as hypothetical user of our climate services. Thisl user, Alice, is a engineer with BC's Ministry of Transportation. She is working in one of BC's remote coastal communities--attached to the outside world by one ferry and a single roadand is assessing its vulnerability to extreme precipitation and its effects on roads, culverts, bridges, and other critical infrastructure.

Alice wants plausible future climate scenarios for the watersheds around the highway. We give the user a map to select their region of interest with an overlay of the climate raster to see where information is available.

On the right hand side, we give our user a tree of various scenarios to select that correspond to different greenhouse gas concentrations pathways, different sets of GCMs and different downscaling methods.

There's a time selector in case they're only interested in a subset of the future. Perhaps they're only desire an analysis of the past plus the future forty years to correspond to the projected lifespan of their bridge.

Then the user just has to select an output format and hit download, and the data starts streaming.

Now, the datasets that a user could download are potentially very large. Each scenario of the full spatiotemporal domain is around 150 GB. Which isn't ridiculous to serve up as a static file over HTTP. However, as soon as you want to write a web app around it, allow dynamic responses, subset request, etc., all of a sudden you have a lot of I/O bound computations to make before your HTTP reseponse gose back, which, ideally should happen in less than once second. So, while we're waiting for the data to download, I'll go through the FOSS components that we used off the shelf and some of the more creative things that we had to do to make all this work.

FOSS Components

Off the shelf

PostgreSQL

Python/h5py

TileCache

ncWMS

OpenLayers

By PCIC (or heavily modified)Pydap-3.2

Python/pupynere

The metadata database indexing and management

The whole UI

To give a sense of the breakdown between how much plug-and-play we were able to do versus how much custom development work we did, here is a comparison. Everything in the left column, we have been able to utilize out of the box to meet our needs. On the right you'll see a couple of python packages to which we made heavy modifications and then there's our data management schemas and the user interface.

FOSS Components

OpenStreetMapOpenLayersfromncWMSPOST toPyDAP

Placing those components on the UI... the front-end is obviously OpenLayers, with a basemap based on OpenStreetMap data rendered by Mapnik, served through TileCache. The climate overlays layer is being served through ncWMS, an aptly named program that takes NetCDF input files and serves them as WMS layers. The data service layer is an OPeNDAP server named PyDAP which has been heavily modified by us and will eventually be released as open source.

Big data, big RAM, BadRequest.
Oh my!

One of the technical problems that we ran up against was that all of the available OPeNDAP data servers load their responses entirely into RAM before sending them out. So if you want to serve up large data sets, the size of your response is limited by your available RAM divided by the number of concurrent responses that you are prepared to serve. If you try and make a request to, say, THREDDS OPeNDAP server that's larger than the JVM allocated memory, the user will just get back a BadRequest error.

For some applications this may be fine, or even desirable, but for the purposes of serving large data sets, the network pipe is usually the bottleneck. Rather than annoy and frustrate the user by forcing them to carve up their data requests to be arbitrarily small, we wanted to allow as large a request as the users were prepared to accept.

Hurry up and wait

One of my big pet peeves is when you go to a website to get data, you hone in on what you want, the anticipation builds, you're ready for your data and then, you get some dialog box suggesting thatWe'll e-mail you when your data is ready! It's the biggest let down to focus in on this dataset that you want and need and then to have to turn your attention to something else while the server puts you, the user, on indefinite hold until it's ready.

I don't intend any disrespect to these websites which I have shown, but at PCIC, one of our key requirements was to avoid this tragic UX mistake and give the user an instant response and begin streaming the data immediately.

Generators: 70's tech that works today!

a function which yields execution rather than returning

yields values one at a time, on-demand

low memory footprint

faster; calling overhead

elegant!

Enter generators and coroutines. Generators are a programming control where a function, rather than returning, can yield execution and sort of return values one at a time on-demand. It has the performance advantage of maintaining a low memory footprint, if you want to return something large, you don't have to do so all at once, and they tend to be slightly faster, because you avoid a lot of calling overhead of stack manipulation.

Generators have been around for a good thirty-five years, but have been experiencing a bit of a Renaissance lately. If one programs in python, they are extremely easy to use, and with the advent of big data applications, they have a lot of utility.

Generator Example

from itertools import islice

def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a+b

# print the first 10 values of the fobonacci sequence for x in islice(fibonacci(), 10): print x

For those who aren't familiar, here's a quick example to understand generators. Generating a Fibonacci sequence is kind of the quintessential toy example. The generator function, fibonacci(), is defined at the top. You'll notice that it's an infinite loop, because the sequence is by definition, infinite. But rather than building up the values in memory, it just has a simple and elegant yield statement right inside the loop. The calling loop down below, actually pulls items from the function, one at a time, and then does whatever it needs to do with them. It's fast, efficient, and actually fairly elegant, readable code, too.

So you can see, for something like a web application serving big datasets, this is perfect, because we can provide a very low latency response, and then stream the data to the user as our high-latency operations like disk reads take place.

None of the OPeNDAP servers out there supported streaming, so our development team has made substantial additions to the pydap OPeNDAP server, to utilize generators for server large datasets. We hope to open source these changes later in the year. All of the code will be available soon enough, so I won't go into it, but know that generators were the key enabling technology.

Back to the demo

With the generators discussion behind us, let's turn back to our demo. We're probably done downloading the data, and it looks like we are.

Our engineer has now quickly downloaded gigabytes of custom-selected, climate scenario output with just a few clicks. The data is fully attributed with metadata, with units, with references and citations to the methods used to perform the downscaling. In the spirit of open science, having all of the metadata directly attached to the data is actually a pretty big deal, because it ensures data provenance is trackable even if further operations are performed on the data later on, which is highly likely. From here, it's relatively easy for her to plug the numbers into whatever impacts model she wants to run.

Conclusions

Governments use climate model output to proactively plan adaptation strategies

Climate model output is pretty big

Don't make the user wait for their data

Generators are awesome

Our work should be available in PyDAP later this year

With that, I'll leave you with my simple conclusions. Governments use our downscaled climate model output to plan for the effects of climate change on their infrastructure. There's so much model output, that data delivery is a non-trival problem. We've tried to make it as easy as possible for our users to narrow the data down to what they actually need and we stream it to them right away. And hopefully later this year, we'll be able to open source our work and make it available to the community.

Acknowledgments and Questions

Thanks to:My small dev team

PCIC's consortium members

Roberto De Almeida (for PyDAP)

You! (OSS developers and supporters)

Finally, I'll close by thanking all of you FOSS developers out there, because without this community, none of this would have been at all possible. I'll be happy to answer any questions at this time.