View
217
Download
3
Tags:
Embed Size (px)
Citation preview
Berkeley Water Center
Early Experience Prototyping a Science Data Server for Environmental Data
Deb Agarwal, LBL ([email protected])
Catharine van Ingen, MSFT ([email protected])
20 September 2006
Berkeley Water Center
Outline
Landscape Data archives and other sources Typical small group collaboration needs Examples using “Ameriflux”
Science Data Server Goals and ideal capabilities Approach Experiences with the current system
Next generation Next set of development efforts Research issues
Conclusion
Berkeley Water Center 6
Ameriflux Collaboration Overview
149 Sites across the AmericasEach site reports a minimum
of 22 common measurements.Communal science – each
principle investigator acts independently to prepare and publish data.
Data published to and archived at Oak Ridge.
Total data reported to date on the order of 150M half-hourly measurements.
http://public.ornl.gov/ameriflux/
Berkeley Water Center
microbesrootswoodleafnetstoragec RRRRPNEEFF 1. Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and
Interpreting CO2 Flux Measurements, Dennis Baldocchi, CarboEurope Summer Course, 2006, Namur, Belgium (http://nature.berkeley.edu/biometlab/lectures/)
What A Tower Sees
Berkeley Water Center
Example Carbon-Climate Investigations
Net carbon exchange for the ecosystem Impact of climate change on the greening of
ecosystems Start of leaf growth Duration of photosynthesis
Effects of early spring on carbon uptake Role of ecosystem and latitude on carbon flux Effect of various pollution sources on carbon in
atmosphere and carbon balance
Berkeley Water Center
Measurements Are Not Simple or Complete
Gaps in the data Quiet nights Bird poop High winds ….
Difficult to make measurements Leaf area index Wood respiration Soil respiration …
Localized measurements – tower footprint Local investigator knowledge important PIs’ science goals are not uniform across the towers
Soils
Climate
Remote SensingExamples of Carbon-Climate Datasets
Observatory datasets
Spatially continuous datasets
Berkeley Water Center
Scientific Data Server - Goals
Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources Simplify provenance by providing a common “safe deposit box”
for assembled data Interact simply with existing and emerging Internet portals for data
and metadata download, and, over time, upload Simplify data assembly by adding automation Simplify name space confusion by adding explicit decode
translation Support basic analyses across the entire dataset for both data
cleaning and science Simplify mundane data handling tasks Simplify quality checking and data selection by enabling data
browsing
Berkeley Water Center
Scientific Data Server - Non-Goals
Replace the large Internet data source sites The technology developed may be applicable, but the focus is
on the group collaboration scale and usability Very large datasets require different operational practices
Perform complex modeling and statistical analyses There are a lot of existing tools with established trust based on
long track records Only part of a full LIMS (laboratory information management
system) Develop a new standard schema or controlled vocabulary
Other work on these is progressing independently Due to the heterogeneity of the data, more than one such
standard seems likely to be relevant
Berkeley Water Center
Scientific Data Server - Workflows
Staging: adding data or metadata New downloaded or field measurements
added New derived measurements added
Editing: changing data or metadata Existing older measurements re-calibrated
or re-derived Data cleaning or other algorithm changes Gap filling
Sharing: making the latest acquired data available rapidly Even before all the checks have been
made Browsing new data before more detailed
analyses Private Analysis: Supporting individual
researchers (MyDB) Stable location for personal calibrations,
derivations, and other data transformations
Import/Export to analysis tools and models Curating: data versioning and provenance
Simple parent:child versioning to track collections of data used for specific uses
Large Data Archives
Local measurements
Berkeley Water Center
Scientific Data Server - Logical Overview
DataAccess
and Analysis
Tools
Latest DatasetDatabase
Latest DatasetCube
Staging Databases
and Cubes
Last Known Good Dataset(s)
Database
Older Dataset(s)Archive
Database
Last Known Good Dataset(s) Cubes
Private Data
Analysis Databases
and Cubes
Scientific Data Server
Analysis ToolsExcel, Matlab, SPlus, SAS,
ArcGIS
Simple web data plots and
tables
BigPlot data browsing
Computational Models
Flat file data import/export
Berkeley Water Center
Databases
All descriptive metadata and data held in relational databases Metadata is important too!
While separate databases are shown, the datasets may actually reside in a single database Mapping is transparent to the
scientist Separate databases used for
performance Unified databases used for
simplicity New metadata and data are staged with
a temporary database Minimal quality checks applied All name and unit conversions
Data may be exported to flat file, copied to a private MyDb database, directly accessed programmatically, or ?
Latest DatasetDatabase
Last Known Good Dataset(s)
Database
Older Dataset(s)Archive
Database
MyDbAnalysis
Database
Staging Database
Berkeley Water Center
Data Cubes
A data cube is a database specifically for data mining (OLAP) Initially developed for commercial
needs like tracking sales of Oreos and milk
Simple aggregations (sum, min, or max) can be pre-computed for speed
Additional calculations (median) can be computed dynamically
Both operate along dimensions such as time, site, or datumtype
Constructed from a relational database
A specialized query language (MDX) is used
Client tool integrations is evolving Excel PivotTables allow simple data
viewing More powerful charting with
Tableaux or ProClarity (commercial mining tools)
Latest DatasetCube
Last Known Good Dataset(s) Cubes
Staging Data Cube
MyDbAnalysis
Data Cubes
Berkeley Water Center
Browsing For Data AvailabilitySites Reporting Data Colored by Year
Ameriflux Data Availability : All Data
Bra
zil -
- T
apaj
os (
San
tare
m,K
mB
razi
l --
Tap
ajos
(S
anta
rem
,Km
Can
ada
- B
orea
s 18
50C
anad
a --
BO
RE
AS
NS
A -
193
0 bu
Can
ada
-- B
OR
EA
S N
SA
- 1
963
buC
anad
a --
BO
RE
AS
NS
A -
198
1 bu
Can
ada
-- B
OR
EA
S N
SA
- 1
989
buC
anad
a --
BO
RE
AS
NS
A -
199
8 bu
Can
ada
-- B
OR
EA
S N
SA
- O
ld B
laC
anad
a --
Brit
ish
Col
., C
ampb
eC
anad
a --
Let
hbrid
geU
SA
--
AK
Atq
asuk
, A
lask
aU
SA
--
AK
Bar
row
, A
lask
aU
SA
--
AK
Hap
py V
alle
y, A
lask
aU
SA
--
AK
Upa
d, A
lask
aU
SA
--
AZ
Aud
ubon
Res
earc
h R
anU
SA
--
CA
Blo
dget
t F
ores
t, C
alU
SA
--
CA
Sky
Oak
s, O
ld S
tand
,U
SA
--
CA
Sky
Oak
s, Y
oung
Sta
nU
SA
--
CA
Ton
zi R
anch
, C
alifo
rU
SA
--
CA
Vai
ra R
anch
, Io
ne,
CU
SA
--
CO
Niw
ot R
idge
For
est,
U
SA
--
CT
Gre
at M
ount
ain
For
esU
SA
--
FL
Flo
rida-
Ken
nedy
Spa
cU
SA
--
FL
Flo
rida-
Ken
nedy
Spa
cU
SA
--
FL
Sla
shpi
ne-A
ustin
Car
US
A -
- F
L S
lash
pine
-Don
alds
on,
US
A -
- F
L S
lash
pine
-Miz
e,cl
ear
US
A -
- F
L S
lash
pine
-Ray
onie
r,m
US
A -
- IL
Bon
dvill
e, I
llino
isU
SA
--
IN M
orga
n M
onro
e S
tate
U
SA
--
KS
Wal
nut
Riv
er W
ater
shU
SA
--
MA
Har
vard
For
est
EM
S T
US
A -
- M
A H
arva
rd F
ores
t he
mlo
US
A -
- M
A L
ittle
Pro
spec
t H
illU
SA
--
ME
How
land
For
est
(mai
nU
SA
--
MI
Syl
vani
a W
ilder
ness
U
SA
--
MI
Uni
v. o
f M
ich.
Bio
loU
SA
--
MO
Mis
sour
i Oza
rk S
iteU
SA
--
MS
Goo
dwin
Cre
ek,
Mis
siU
SA
--
MT
For
t P
eck,
Mon
tana
US
A -
- N
C D
uke
For
est
- lo
blol
US
A -
- N
C D
uke
For
est-
hard
woo
dU
SA
--
NE
Mea
d -
irrig
ated
con
US
A -
- N
E M
ead
- irr
igat
ed m
aiU
SA
--
NE
Mea
d -
rain
fed
mai
zeU
SA
--
OK
Litt
le W
ashi
ta W
ater
US
A -
- O
K P
onca
City
, O
klah
oma
US
A -
- O
K S
hidl
er,
Okl
ahom
aU
SA
--
OK
Sou
ther
n G
reat
Pla
inU
SA
--
OR
Met
oliu
s-fir
st y
oung
US
A -
- O
R M
etol
ius-
inte
rmed
iat
US
A -
- O
R M
etol
ius-
old
aged
po
US
A -
- S
D B
lack
Hill
s, S
outh
DU
SA
--
SD
Bro
okin
gs,
Sou
th D
akU
SA
--
TN
Wal
ker
Bra
nch
Wat
ers
US
A -
- W
A W
ind
Riv
er C
rane
Sit
US
A -
- W
I Lo
st C
reek
, W
isco
nsi
US
A -
- W
I P
ark
Fal
ls/W
LEF
, W
isU
SA
--
WI
Will
ow C
reek
, W
isco
nU
SA
--
WV
Can
aan
Val
ley,
Wes
t
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
Berkeley Water Center
Browsing For Data Availability Total Data Availability by Type Colored by Site
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
AP
AR
CO
2
DT
FC
FG
FH
2O
FP
AR
GP
P H
H2
O LE
Le
afw
etn
ess
NE
E
O3
Oth
er
PA
R
PR
EC
PR
ES
S
Rd
Rg
Rg
l
RH
Rn
Sa
SC
O2
SV
P
SW
C
TA
TA
U
Tb
ole
Td
ew
TS U
US
T
UW
VP
D
WD
WS
USA -- WV Canaan Valley, West USA -- WI Willow Creek, WisconUSA -- WI Park Falls/WLEF, WisUSA -- WI Lost Creek, WisconsiUSA -- WA Wind River Crane SitUSA -- TN Walker Branch WatersUSA -- SD Brookings, South DakUSA -- SD Black Hills, South DUSA -- OR Metolius-old aged poUSA -- OR Metolius-intermediatUSA -- OR Metolius-f irst youngUSA -- OK Southern Great PlainUSA -- OK Shidler, OklahomaUSA -- OK Ponca City, OklahomaUSA -- OK Little Washita WaterUSA -- NE Mead - rainfed maizeUSA -- NE Mead - irrigated maiUSA -- NE Mead - irrigated conUSA -- NC Duke Forest-hardw oodUSA -- NC Duke Forest - loblolUSA -- MT Fort Peck, MontanaUSA -- MS Goodw in Creek, MissiUSA -- MO Missouri Ozark SiteUSA -- MI Univ. of Mich. BioloUSA -- MI Sylvania Wilderness USA -- ME How land Forest (mainUSA -- MA Little Prospect HillUSA -- MA Harvard Forest hemloUSA -- MA Harvard Forest EMS TUSA -- KS Walnut River WatershUSA -- IN Morgan Monroe State USA -- IL Bondville, IllinoisUSA -- FL Slashpine-Rayonier,mUSA -- FL Slashpine-Mize,clearUSA -- FL Slashpine-Donaldson,USA -- FL Slashpine-Austin CarUSA -- FL Florida-Kennedy SpacUSA -- FL Florida-Kennedy SpacUSA -- CT Great Mountain ForesUSA -- CO Niw ot Ridge Forest, USA -- CA Vaira Ranch, Ione, CUSA -- CA Tonzi Ranch, CaliforUSA -- CA Sky Oaks, Young StanUSA -- CA Sky Oaks, Old Stand,USA -- CA Blodgett Forest, CalUSA -- AZ Audubon Research RanUSA -- AK Upad, AlaskaUSA -- AK Happy Valley, AlaskaUSA -- AK Barrow , AlaskaUSA -- AK Atqasuk, AlaskaCanada -- LethbridgeCanada -- British Col., CampbeCanada -- BOREAS NSA - Old BlaCanada -- BOREAS NSA - 1998 buCanada -- BOREAS NSA - 1989 buCanada -- BOREAS NSA - 1981 buCanada -- BOREAS NSA - 1963 buCanada -- BOREAS NSA - 1930 buCanada - Boreas 1850Brazil -- Tapajos (Santarem,KmBrazil -- Tapajos (Santarem,Km
Data type reporting is far from uniform across type
Berkeley Water Center
Browsing for Data Availability Total Data Availability by Site Colored by Type
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
Bra
zil -
- T
apaj
os (
San
tare
m,K
m
Bra
zil -
- T
apaj
os (
San
tare
m,K
m
Can
ada
- B
orea
s 18
50
Can
ada
-- B
OR
EA
S N
SA
- 1
930
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
963
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
981
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
989
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
998
bu
Can
ada
-- B
OR
EA
S N
SA
- O
ld B
la
Can
ada
-- B
ritis
h C
ol.,
Cam
pbe
Can
ada
-- L
ethb
ridge
US
A -
- A
K A
tqas
uk, A
lask
a
US
A -
- A
K B
arro
w, A
lask
a
US
A -
- A
K H
appy
Val
ley,
Ala
ska
US
A -
- A
K U
pad,
Ala
ska
US
A -
- A
Z A
udub
on R
esea
rch
Ran
US
A -
- C
A B
lodg
ett F
ores
t, C
al
US
A -
- C
A S
ky O
aks,
Old
Sta
nd,
US
A -
- C
A S
ky O
aks,
You
ng S
tan
US
A -
- C
A T
onzi
Ran
ch, C
alifo
r
US
A -
- C
A V
aira
Ran
ch, I
one,
C
US
A -
- C
O N
iwot
Rid
ge F
ores
t,
US
A -
- C
T G
reat
Mou
ntai
n F
ores
US
A -
- F
L F
lorid
a-K
enne
dy S
pac
US
A -
- F
L F
lorid
a-K
enne
dy S
pac
US
A -
- F
L S
lash
pine
-Aus
tin C
ar
US
A -
- F
L S
lash
pine
-Don
alds
on,
US
A -
- F
L S
lash
pine
-Miz
e,cl
ear
US
A -
- F
L S
lash
pine
-Ray
onie
r,m
US
A -
- IL
Bon
dville
, Illin
ois
US
A -
- IN
Mor
gan
Mon
roe
Sta
te
US
A -
- K
S W
alnu
t Riv
er W
ater
sh
US
A -
- M
A H
arva
rd F
ores
t EM
S T
US
A -
- M
A H
arva
rd F
ores
t hem
lo
US
A -
- M
A L
ittle
Pro
spec
t Hill
US
A -
- M
E H
owla
nd F
ores
t (m
ain
US
A -
- M
I Syl
vani
a W
ilder
ness
US
A -
- M
I Uni
v. o
f Mic
h. B
iolo
US
A -
- M
O M
isso
uri O
zark
Site
US
A -
- M
S G
oodw
in C
reek
, Mis
si
US
A -
- M
T F
ort P
eck,
Mon
tana
US
A -
- N
C D
uke
For
est -
lobl
ol
US
A -
- N
C D
uke
For
est-
hard
woo
d
US
A -
- N
E M
ead
- irr
igat
ed c
on
US
A -
- N
E M
ead
- irr
igat
ed m
ai
US
A -
- N
E M
ead
- ra
infe
d m
aize
US
A -
- O
K L
ittle
Was
hita
Wat
er
US
A -
- O
K P
onca
City
, Okl
ahom
a
US
A -
- O
K S
hidl
er, O
klah
oma
US
A -
- O
K S
outh
ern
Gre
at P
lain
US
A -
- O
R M
etol
ius-
first
you
ng
US
A -
- O
R M
etol
ius-
inte
rmed
iat
US
A -
- O
R M
etol
ius-
old
aged
po
US
A -
- S
D B
lack
Hills
, Sou
th D
US
A -
- S
D B
rook
ings
, Sou
th D
ak
US
A -
- T
N W
alke
r B
ranc
h W
ater
s
US
A -
- W
A W
ind
Riv
er C
rane
Sit
US
A -
- W
I Los
t Cre
ek, W
isco
nsi
US
A -
- W
I Par
k F
alls
/WLE
F, W
is
US
A -
- W
I Willo
w C
reek
, Wis
con
US
A -
- W
V C
anaa
n V
alle
y, W
est
APAR CO2 DT FC FG FH2O FPAR GPP H H2OLE Leafwetness NEE O3 Other PAR PREC PRESS Rd RgRgl RH Rn Sa SCO2 SVP SWC TA TAU TboleTdew TS U UST UW VPD WD WS
Sites report more data either because of longevity or specific research interests
Berkeley Water Center
Browsing for Data Quality
Real field data has both short term gaps and longer term outages The utility of the data depends
on the nature of the science being performed
Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8 9 10 11 12
55.86306 BOREAS NSA -1981 burn site
55.879002 BOREAS NSA -Old Black Spruce
55.90583 BOREAS NSA -1930 burn site
55.911671 BOREAS NSA -1963 burn site
55.916672 BOREAS NSA -1989 burn site
56.63583 BOREAS NSA -1998 burn site
69.133331 AK HappyValley
70.281471 AK Upad
70.496002 AK Atqasuk
Data often missing in the winter!
Soil Water Content (% Volume)
-10
-5
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60
Tonzi Ranch
Vai
ra R
anch
0 (cm)
20 (cm)Measurements charted on axes are gaps
-15
-10
-5
0
5
10
15
20
25
30
20 30 40 50 60 70 80
Latitude
Deg
C
Average Temperature
What’s going on at higher latitudes? (It should be getting colder)
Data Count
Berkeley Water Center
Browsing for Data Quality
Real field data has unit and time scale conversion problems Sometimes easy to spot in
isolation Sometimes easier to spot when
comparing to other data Browsing data values can give
rapid insight into how the data can be used before more complex analyses are performed
Maximum Annual Air TemperatureGlobal Warming or Reporting in Fahrenheit?
Odd Microclimate Effects or Error in Time Reporting ?
Average Air Temperature Two Nearby Sites
Local time or GMT time?
Berkeley Water Center
Lessons Learned To Date
Metadata is as important as data Comparing sites of like vegetation, climate is as important
as latitude or other physical quantity Curate the two together
Controlled vocabularies are hard Humans like making up names and have a hard time
remembering 100+ names Assume a decode step in the staging pipeline
There are at least three database schema families and two cube construction approaches Everyone has a favorite Each has advantages and disadvantages Automate the maintenance and use the right one for the
right job Visual programming tools are great for prototyping
But debugging and maintenance can hit a wall It’s easy to overbuild – use when “good enough”
Data analysis and data cleaning are intertwined Data cleaning is always on-going Share the simple tools and visualizations
The saga continues at http://dsd.lbl.gov/BWC/amfluxblog/ and http://research.microsoft.com/~vaningen/BWC/BWC.htm
Berkeley Water Center
Near Term Futures
Improve current capabilities Assemble gap-filled and non-gap filled data sets Implement incremental data staging to enable speedy and
simple data editing by an actual scientist (rather than a programmer)
Implement expanded metadata handling to enable scientist to add site characteristics and sort sites on those expanded definitions
Add basic reporting capabilities for server-side browsing of data availability to speed and simplify locating “interesting” data
Apply Data Server capabilities to a different set of data with different (but related) science Considering either Russian River or Yosemite Valley
hydrological data Will be automating download from multiple different
national data sets Spatial (GIS) analyses more important Linkage with imagery data necessary for science
Berkeley Water Center
Longer Term Futures
Handling imagery and other remote sensing information Curating images is different from curating time series data Using both together enables new science and new insights Graphical selection and display of data
Support for user specified calculations within the database Support for direct connections to analysis and statistical
packages Linkage with models
Additional (emerging) data standards such as NetCDF Handling “just in time” data delivery and model result
curation Data mining subscription services Handling of a broader array of data types Support for workflow tools
Berkeley Water Center
Conclusions
Large data archives create the opportunity to Do science at the regional and global scale Combine data from multiple disciplines Perform historical trend analysis
Small scientific collaborations need help to Perform analyses using more data than they can
currently manage Enable data handling and versioning Store the currently needed data and metadata Browse the data for science
It’s the science, not the computer science Computer science research can certainly help