View
214
Download
0
Tags:
Embed Size (px)
Citation preview
May 29, 2007
Dynamically Adaptive Weather Analysis and Forecasting in LEAD: Issues in Data Management, Metadata, and
Search
Beth PlaleDirector, Center for Data and Search Informatics
School of Informatics Indiana University, US
May 29, 2007
Introduction• Linked Environments for Atmospheric Discovery
(LEAD) makes meteorological data, forecast models, and analysis and visualization tools available to anyone who wants to interactively explore the weather as it evolves.
• In this talk we describe key data management aspects of the project - those projects being carried out in the Center for Data and Search Informatics at Indiana University
May 29, 2007
Infrastructure is portal based - that is, all services are available
through a web server
Infrastructure is portal based - that is, all services are available
through a web server
May 29, 2007
Gateway ServicesGateway Services
Core Grid ServicesCore Grid Services
e-Science Gateway Architecture
Grid Portal Server
Grid Portal Server
ExecutionManagement
ExecutionManagement
InformationServices
InformationServices
SelfManagement
SelfManagement
DataServices
DataServices
ResourceManagement
ResourceManagement
SecurityServices
SecurityServices
Resource Virtualization (OGSA)Resource Virtualization (OGSA)
Compute Resources Data Resources Instruments & Sensors
Proxy CertificateServer (Vault)
Proxy CertificateServer (Vault)
Events & Messaging
Events & Messaging
Resource BrokerResource Broker
Community & User Metadata Catalog
Community & User Metadata Catalog
Workflow engine
Workflow engine Resource
Registry
Resource Registry
ApplicationDeployment
ApplicationDeployment
User’s Grid DesktopUser’s Grid Desktop
[1][1] Service Oriented Architectures for Science Gateways on Grid Systems, Gannon, D., et al.; ICSOC, 2005
May 29, 2007
arpssfc
arpstrn Ext2arps-ibc
88d2arps
mci2arps
ADASassimilation
arps2wrf
nids2arps
WRF
Ext2arps-lbc
wrf2arps
arpsplot
IDV viz
Terrain data files
Surface data files
ETA, RUC, GFS data
Radar data (level II)
Radar data (level III)
Satellite data
Surface, upper air mesonet & wind profiler
data
Typical weather forecast runs as workflow
~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during
Workflow LifecycleWorkflow Lifecycle
~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during
Workflow LifecycleWorkflow Lifecycle
Pre-ProcessingPre-Processing AssimilationAssimilation ForecastForecast VisualizationVisualization
May 29, 2007
To set up workflowexperiment,
we select a workflow(not shown)
then set model parameters here
To set up workflowexperiment,
we select a workflow(not shown)
then set model parameters here
May 29, 2007
Data Integration
CASA radarCollection,
Months (ftp)
Latest 3 days Unidata IDD Distribution
(XML web server)
Level II and III radar, latest
3 days(XML web server)
ETA, NCEP, NAM,
METAR, etc.(XML web server)
Oklahoma
Indiana
Colorado
ColoradoIndexXMLDB native XML database
and Lucene for index
Local view: crosswalk point of presence supports crawling,
publishes difference list as LEAD Metadata Schema (LMS)
documents
• Crawler crawls catalogs; • Builds index of results; • Web service API; • Boolean search query with spatial/temporal support
Globally integrated view: Data Catalog Service
Web s
erv
ice A
PI
Boolean search query
List of results as LEAD Metadata
Schema documents
crosswalks
May 29, 2007
LEAD Personal Workspace
• CyberInfrastructure extends user’s desktop to incorporate vast data analysis space.
• As users go about doing scientific experiments, the CI manages back-end storage and compute resources.
• Portal provides ways to explore this data and search and discover it.
• Metadata about experiments is largely automatically generated, and highly searchable.
• Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”.
May 29, 2007
Searching for experiments based on model parameters: 4 returned experiments; one displayed
May 29, 2007
How forecast model configuration parameters stored in personal catalog
Forecast model configuration file handed off to plugin that shreds XML
document into queriable attributes
associated with experiment
May 29, 2007
What & Why of Provenance• Derivation history of a data product
• What (when, where) application created the data• Its parameters & configuration• Other input data used by application
• Workflow is composed from building blocks like these. So provenance for data used in workflow gives workflow trace
ApplicationA
Data.Out.1
Data.In.1
Config.A
Data.In.2
Data Provenance::Data.Out.1Process: Application_A Timestamp: 2006-06-23T12:45:23 Host: tyr20.cs.indiana.edu …Input: Data.In.1, Data.In.2Config: Config.A
May 29, 2007
The What & Why of Provenance• Trace Workflow Execution
• What services were used during workflow execution?• Validate if all steps of execution successful?
• Audit Trail• What resources were used during workflow execution?
• Data Quality & Reuse• What applications were used to derived data products?• Which workflows use a certain data product?
• Attribution• Who performed the experiment?• Who owns the workflow & data products?
• Discovery• Locate data generated by a workflow• Locate workflows containing App-X that succeeded
May 29, 2007
Karma Provenance ServiceKarma Provenance Service
ProvenanceListener
ProvenanceListener
ActivityDB
ActivityDB
Collection Framework
Workflow Instance10 Data Products Consumed & Produced by each Service
Workflow Instance10 Data Products Consumed & Produced by each Service
Service2
Service2 ……Service
1Service
1Service
10Service
10Service
9Service
910P/10C
10C
10P 10C 10P/10C
10P
Workflow Engine
Workflow Engine
Message Bus WS-Eventing Service API Message Bus WS-Eventing Service API WS-Messenger
Notification BrokerWS-Messenger
Notification Broker
Publish Provenance Activities as Notifications
Application–Started & –Finished, Data–Produced & –ConsumedActivities
Workflow–Started & –Finished Activities
ProvenanceQuery API
ProvenanceQuery API
Provenance Browser ClientProvenance
Browser Client
Query for Workflow, Process,& Data Provenance
Subscribe & Listen toActivity Notifications
A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al., ICWS Conference, 2006
May 29, 2007
Generating Karma Provenance Activities• Instrument applications to publish provenance• Simple Java Library available to
• Create provenance activities• Publish activities as messages
• Jython “wrapper” scripts use library to publish provenance & invoke application
• Generic Factory toolkit easily converts applications to web service• Built-in provenance instrumentation
May 29, 2007
Sample Sequence of ActivitiesappStarted(App1)
info(‘App1 starting’)fileReceiveStarted(File1)
-- do gridftp get to stage input file File1 --fileReceiveFinished(File1)fileConsumed(File1)computationStarted(Code1)
-- call Fortran code Code1 to process input files --computationFinished(Code1)fileProduced(File2)fileSendStarted(File2)
-- do gridftp put to save output file File2 --fileSendFinished(File2)publishURL(File2)
appFinishedSuccess(App1, File2) | appFinishedFailed(App1, ERR)flush()
May 29, 2007
Performance perturbation
1460
286
1969
296
1643
28092785
419
2216
1653
28342805
426
2233
6
4
1
4
5
- 3
5
0
4
0
500
1000
1500
2000
2500
3000
Start
Terrain PreProcSurface PreProc
3D Interp
ARPS2 WRF
WRF
WRF2 ARPSARPS Plot PS2Image
W o r k f l o w A p p l i c a t i o n S c r i p ts E x e c u ti o n S e q u e n c e
Cumulative Time for Execution (Secs)
-15
-10
-5
0
5
10
15
Provenance Overhead for Each Script (Secs)
C u m u l a t i v e T i m e w / o P r o v e n a n c e ( S e c s )
C u m u l a t i v e T i m e w / P r o v e n a n c e ( S e c s )
P r o v e n a n c e O v e r h e a d
May 29, 2007
Scalability Study4
[4][4] Performance Evaluation of the Karma Provenance Framework for Scientific Workflows, Simmhan, Y., et al.; IPAW Workshop, 2006
May 29, 2007
LEADBPEL
WorkflowEngine
WorkflowConfiguration
ServicePortal
Event Broker
Workflow
Application Service
(per task)
Workflow and File Status
DAG
myLEAD(subscribes to messages from
the broker and knows what magic to do with input/output files and
talks to RLS/DRS
Run workflow one
step at a time
Run job
Jobnotification
CreateServices
App. Factory
Launch Services
ResourceManagement
Services
Sensor
Actuator
Resource adaptation illustrated (1)
Resource has failed, need to
reschedule remaining parts
of workflow
Stop the earlier workflow
Replan the workflow
ResourceChanges
May 29, 2007
LEADBPEL
WorkflowEngine
WorkflowConfiguration
ServicePortal
Event Broker
Workflow
Application Service
(per task)
Workflow and File Status
DAG
myLEAD(subscribes to messages from
the broker and knows what magic to do with input/output files and
talks to RLS/DRS
Run workflow one
step at a time
Run job
Jobnotification
CreateServices
App. Factory
Launch Services
ResourceManagement
Services
Sensor
Actuator
Resource adaptation illustrated (2)
Implement strict deadline
scheduling
Weather change
Plan resourcesfor sub-
components
Change priorities for users e.g. Lavanya’s workflow gets
lower priority
Implement Adverse
Weather Policy
May 29, 2007
LEADBPEL
WorkflowEngine
WorkflowConfiguration
ServicePortal
Event Broker
Workflow
Application Service
(per task)
DAG
myLEAD(subscribes to messages from
the broker and knows what magic to do with input/output files and
talks to RLS/DRS
Run workflow one
step at a time
Run job
Jobnotification
CreateServices
App. Factory
Launch Services
ResourceManagement
Services
Sensor
Actuator
Resource adaptation illustrated (3)Services
“ReplicateService”“Service
Overloaded”
May 29, 2007
Recent LEAD HighlightSpring 2007 Weather Challenge Forecast contest - February -
March 2007• Students ran …..
Statistics from the Challenge• Approximately 50 participants• 6696 jobs submitted to Teragrid (52925 TG SU's), and • Generated about 2.6 TB of data which is archived at Indiana
University and available though each participating user’s personal workspace catalog.
• Computational models run on Teragrid resources. Portal and persistent back-end services run at Indiana University. Data storage resources (45 TB) for user-generated data products provided by Indiana University.
May 29, 2007
Future Work• Optimizations and refinements: file movement,
revisit metadata schema, improve crosswalks with eye to reduced maintenance
• Personal predictor - packaging LEAD framework into single 8-16 core multicore machine for the individual purchase
May 29, 2007
Thanks to the whole LEAD team, and the National Science Foundation for their support.
For more information, feel free to contact me at [email protected] or go to http://www.leadportal.org