Pipelines andPipelines andScientific Scientific WorkflowsWorkflows
with Ptolemy IIwith Ptolemy II
Deana PenningtonDeana PenningtonUniversity of New MexicoUniversity of New Mexico
LTER Network OfficeLTER Network Office
Shawn BowersShawn BowersUCSDUCSD
San Diego Supercomputer CenterSan Diego Supercomputer Center
Analytical Analytical PipelinesPipelines
ASx TS1 ASy ASz ASrTS2
ASx
TS1
Analysis Step in anExecution Environment:SAS, MATLAB, etc.
Transformation Step
ASx
AP0
Library ofAnalysis steps &Analytical Pipeline
ECO Taxon
Parameter Ontologies& Taxonomies
Semantic Mediation SystemLogic Rules Query Processing
Parameters w/Semantics
AP0
Scientific WorkflowsScientific Workflows
ASx TS1 ASy ASz ASrTS2
Search forrelevant
data(Query)
ASx TS1 ASz ASrTS2
ASrTS2
Iterative
SW0
BenefitsBenefits
•Reusable analysis steps, pipelines, and workflows•Formal documentation of methods
(output in report format)•Reproducibility of methods•Visual creation and communication of methods•Versioning•Automated data typing and transformation
Ptolemy II demoPtolemy II demo
Geographic Space Ecological Space
Projection back onto geography
Native range prediction
Invaded range prediction
Ecological Niche Ecological Niche ModelingModeling
Results used for integration with
other data realms (e.g., human populations, public health,
etc.)
Geospatial and remotely sensed data
Vegetation class
Precipitation
Modified from B. Michener
ecological niche modeling
vegetation class
Model of niche in ecological dimensions
pre
cip
itatio
n
Model type:•Linear regression (GRASP)•Genetic algorithms (GARP)
Biodiversity information … e.g., data from
museum specimens
Ecological Niche ModelsEcological Niche Models
Elevation (m)
Vegetation cover type
P, juniper, 2200m, 16CP, pinyon, 2320m, 14CA, creosote, 1535m, 22C
Sample 3, lat, long, absence
Mean annual temperature (C)
Access File
Excel File
Integrated data:
Sample 2, lat, long, presence
Sample 1, lat, long, presence
GARP Native-Species Pipeline GARP Native-Species Pipeline (informal)(informal)
Training sample
GARPrule set
Test sample
Species pres. & abs.
points
EcoGridQuery
EcoGridQuery
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Native range prediction map
Env. layers
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
Integrated layers
GARPrule set
Species pres. & abs.
points
GARP Native-Species Pipeline GARP Native-Species Pipeline (informal)(informal)
GARPrule set
Species pres. & abs.
points
EcoGridQuery
EcoGridQuery
LayerIntegration
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Native range prediction map
Env. layers
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
GARPrule set
Training sample
Test sample
SampleData
Integrated layers
Species pres. & abs.
points
We will look at this
analytic step
+A3+A2
+A1
Sample Data: Basic Input/OutputSample Data: Basic Input/Output
parameters
SampleData
+A3+A2
+A1
Test Sample of Conditioned Data
Training Sample of Conditioned Data
Environmental Layers(temp., vegetation, etc.)
Species presence
points
input output
Presence under environmental
conditions
Dependent-Variable Coordinates
Independent-Variable Coordinates
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step (for conceptually define the analytic step (for discovery and integration)discovery and integration)
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
Sample Data: Physical LevelSample Data: Physical Level
parameters
SampleData
+A3+A2
+A1
33.454606, 106.789098;33.454606, 106.789097; …
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
1, 56.25, 0, 20, …, 44;0, 57.34, 0, 55, …, 14;…
0, 77.33, 1, 50, …, 44;1, 56.01, 0, 55, …, 14;…
input output
An actual program thatimplements Sample Data
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
Data ascomma-delimited,
plain text files
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
GARP Native-Species Pipeline GARP Native-Species Pipeline (informal)(informal)
GARPrule set
Species pres. & abs.
points
EcoGridQuery
EcoGridQuery
LayerIntegration
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Native range prediction map
Env. layers
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
GARPrule set
Training sample
Test sample
SampleData
Integrated layers
Species pres. & abs.
points
We will look at this
analytic step
+A3+A2
+A1
Sample Data: Basic Input/OutputSample Data: Basic Input/Output
parameters
SampleData
+A3+A2
+A1
Test Sample of Conditioned Data
Training Sample of Conditioned Data
Environmental Layers(temp., vegetation, etc.)
Species presence
points
input output
Presence under environmental
conditions
Dependent-Variable Coordinates
Independent-Variable Coordinates
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step (for conceptually define the analytic step (for discovery and integration)discovery and integration)
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
Sample Data: Physical LevelSample Data: Physical Level
parameters
SampleData
+A3+A2
+A1
33.454606, 106.789098;33.454606, 106.789097; …
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
1, 56.25, 0, 20, …, 44;0, 57.34, 0, 55, …, 14;…
0, 77.33, 1, 50, …, 44;1, 56.01, 0, 55, …, 14;…
input output
An actual program thatimplements Sample Data
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
Data ascomma-delimited,
plain text files
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
Logical descriptionsLogical descriptions
Recall that a schema sets the Recall that a schema sets the allowable structure for dataallowable structure for data
Employee
name : string age : integer ssn : string title : string salary : int
Smith 40 555-… 5
Jones 36 555-… 4
Davis 22 555-… 2
Clark 50 555-… Mgr. 75000
Lewis 36 555-… Sales 40000
These tables are not allowable instancesof the logical description
Allen
Young
too many columnstoo few columns,wrong datatypes
Sample Data: Logical LevelSample Data: Logical Level
parameters
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(pres, temp, veg, …, zn)
input output
sample2(pres, temp, veg, …, zn)
2-dimensional matrix
List of 3-dimensional matrices, one matrix per
environmental layer
Relation of n+1 attributesfor n environmental layers
Why have the Logical Level?Why have the Logical Level?Data independenceData independence
Hides the details of Hides the details of howhow information is information is represented (text or binary files) from represented (text or binary files) from whatwhat is is represented (a table of integers)represented (a table of integers)
Reduced application development timeReduced application development timeMakes information more easily reusable, for Makes information more easily reusable, for
example, by other applications or services – example, by other applications or services – with programs for handling the with programs for handling the physical/logical levelphysical/logical level
Can help enable integrationCan help enable integrationExplicit knowledge of the structure and types Explicit knowledge of the structure and types
of data can help automate conversion, for of data can help automate conversion, for example, by using higher-level languagesexample, by using higher-level languages
Choosing a logical Choosing a logical representationrepresentation
parameters
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(pres, temp, veg, …, zn)
input output
sample2(pres, temp, veg, …, zn)
2-dimensional matrix
List of 3-dimensional matrices, one matrix per
environmental layer
Relation of n+1 attributesfor n environmental layers
Can you see any potential problems with this choice of logical output?
Choosing a logical Choosing a logical representationrepresentation
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(pres, z1, z2, …, zn)
sample2(pres, z1, z2, …, zn)
Service
avail(pres, temp, veg, elev)
The output structure is dependent on the input data…
?
+A3+A2
+A1
GARP Native-Species Pipeline GARP Native-Species Pipeline (informal)(informal)
GARPrule set
Species pres. & abs.
points
EcoGridQuery
EcoGridQuery
LayerIntegration
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Native range prediction map
Env. layers
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
GARPrule set
Training sample
Test sample
SampleData
Integrated layers
Species pres. & abs.
points
We will look at this
analytic step
+A3+A2
+A1
Sample Data: Basic Input/OutputSample Data: Basic Input/Output
parameters
SampleData
+A3+A2
+A1
Test Sample of Conditioned Data
Training Sample of Conditioned Data
Environmental Layers(temp., vegetation, etc.)
Species presence
points
input output
Presence under environmental
conditions
Dependent-Variable Coordinates
Independent-Variable Coordinates
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step (for conceptually define the analytic step (for discovery and integration)discovery and integration)
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
Sample Data: Physical LevelSample Data: Physical Level
parameters
SampleData
+A3+A2
+A1
33.454606, 106.789098;33.454606, 106.789097; …
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
1, 56.25, 0, 20, …, 44;0, 57.34, 0, 55, …, 14;…
0, 77.33, 1, 50, …, 44;1, 56.01, 0, 55, …, 14;…
input output
An actual program thatimplements Sample Data
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
Data ascomma-delimited,
plain text files
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
Logical descriptionsLogical descriptions
Recall that a schema sets the Recall that a schema sets the allowable structure for dataallowable structure for data
Employee
name : string age : integer ssn : string title : string salary : int
Smith 40 555-… 5
Jones 36 555-… 4
Davis 22 555-… 2
Clark 50 555-… Mgr. 75000
Lewis 36 555-… Sales 40000
These tables are not allowable instancesof the logical description
Allen
Young
too many columnstoo few columns,wrong datatypes
Sample Data: Logical LevelSample Data: Logical Level
parameters
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(pres, temp, veg, …, zn)
input output
sample2(pres, temp, veg, …, zn)
2-dimensional matrix
List of 3-dimensional matrices, one matrix per
environmental layer
Relation of n+1 attributesfor n environmental layers
Why have the Logical Level?Why have the Logical Level?Data independenceData independence
Hides the details of Hides the details of howhow information is information is represented (text or binary files) from represented (text or binary files) from whatwhat is is represented (a table of integers)represented (a table of integers)
Reduced application development timeReduced application development timeMakes information more easily reusable, for Makes information more easily reusable, for
example, by other applications or services – example, by other applications or services – with programs for handling the with programs for handling the physical/logical levelphysical/logical level
Can help enable integrationCan help enable integrationExplicit knowledge of the structure and types Explicit knowledge of the structure and types
of data can help automate conversion, for of data can help automate conversion, for example, by using higher-level languagesexample, by using higher-level languages
Choosing a logical Choosing a logical representationrepresentation
parameters
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(pres, temp, veg, …, zn)
input output
sample2(pres, temp, veg, …, zn)
2-dimensional matrix
List of 3-dimensional matrices, one matrix per
environmental layer
Relation of n+1 attributesfor n environmental layers
Can you see any potential problems with this choice of logical output?
Choosing a logical Choosing a logical representationrepresentation
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(pres, z1, z2, …, zn)
sample2(pres, z1, z2, …, zn)
Service
avail(pres, temp, veg, elev)
The output structure is dependent on the input data…
?
+A3+A2
+A1
Choosing a logical Choosing a logical representationrepresentation
SampleData
+A3+A2
+A1matrix[x, y]
list(matrix[x, y, z])
sample1(obs, property, value)
sample2(obs, property, value)
Service
avail(obs, property, value)
Reusability is easier when the logical representation is known ahead of time…
Analytic-Step AbstractionsAnalytic-Step Abstractions
Physical LevelPhysical LevelAn analytic step is a particular software An analytic step is a particular software
implementation that takes and produces implementation that takes and produces physical data (for example, files) physical data (for example, files)
Logical LevelLogical LevelDefines the structure of input and output Defines the structure of input and output
(like a database schema)(like a database schema)
Semantic LevelSemantic LevelUses ontological information to Uses ontological information to
conceptually define the analytic step conceptually define the analytic step (for discovery and integration)(for discovery and integration)
Sample Data: Semantic Sample Data: Semantic input/outputinput/output
EcologicalModel
BiodiversityModel
EcoNicheModel
RegressionBased ENM
LogisticRegression
RegressionModel
StatisticalModel
usesRegressionModel
DependentVariable
IndependentVariable
StatisticalVariable
StatisticalContext
hasIndVarhasDepVar
hasContext
Putting it all togetherPutting it all together
parameters
SampleData
+A3+A2
+A1
input output
Physical = DataLogical + Semantic Metadata
list(matrix[x, y, z])33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
33.454606, 106.789098, 56.25;33.454606, 106.789097, 56.37;…
IndependentVariable
hasContextGridCoordinate
StatisticalContext
DependentVariable
hasContextGrid
Coordinate
StatisticalContext
matrix[x, y]33.454606, 106.789098;33.454606, 106.789097; …
StatisticalDataset
DependentVariable
IndependentVariable
hasDepVar
hasIndVar
sample1(obs, property, value)1, 56.25, 0, 20, …, 44;0, 57.34, 0, 55, …, 14;…
StatisticalDataset
DependentVariable
IndependentVariable
hasDepVar
hasIndVar
sample2(obs, property, value)1, 56.25, 0, 20, …, 44;0, 57.34, 0, 55, …, 14;…
Domain WorkflowDomain Workflow
Training sample
GARPrule set
Test sample
Species pres. & abs.
points
EcoGridQuery
EcoGridQuery
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Native range prediction map
Env. layers
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
Integrated layers
GARPrule set
Species pres. & abs.
points
Generic WorkflowGeneric Workflow
Training sample
GARP (or other)
rule set
Test sample
OccurrenceData
Binary, Categorical or Numeric
EcoGridQuery
EcoGridQuery
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Prediction map
Environmental
layers
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
Integrated layers
GARPrule set
Temperature Interpolation Temperature Interpolation WorkflowWorkflow
Training sample
GARPrule set
Test sample
Weather stationtemperature
data
EcoGridQuery
EcoGridQuery
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Model qualityparameters
Prediction map:
Interpolated temperature
grid
Environmental
layers:elevation, aspect,
land cover
GenerateMetadata
ArchiveTo Ecogrid
Selectedprediction
maps
PhysicalTransformatio
n
Scaling
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
EcoGridDataBase
Integrated layers
Integrated layers
GARPrule set
Extending Workflows: Extending Workflows: ClimateClimate
ASx TS1 ASy ASz ASrTS2
Current environmental layers:
Prediction maps under current conditions
ASx TS1 ASy ASz ASrTS2
Changed environmental layers:
Prediction maps under changed conditions
Compare to get predictedeffect of environmental
change on species
Prediction model fromnative area
Extending Workflows: Extending Workflows: InvasionInvasion
ASx TS1 ASy ASz ASrTS2
Native area occurrence and environmental layers:
Prediction maps innative area
ASx TS1 ASy ASz ASrTS2
Invasion area environmental layers:
Prediction maps in invasion area
Prediction model fromnative area
ProcessProcess
1.Create the domain workflow at a conceptual level
2.Define the physical and logical data types for each step
3.Define the ontological data types for each step, for both the domain and a generic ontology
4.Map the domain workflow to a generic workflow
5.Map the generic workflow to other domain workflows
ExerciseExercise Divide into two groups (roughly half in each): Divide into two groups (roughly half in each):
1.1. Climate changeClimate change2.2. Invasive speciesInvasive species
Download generic workflow from:Download generic workflow from:ftp://ftp.lternet.edu/pub/outgoing/penningdftp://ftp.lternet.edu/pub/outgoing/penningd
Work on conceptual workflows that:Work on conceptual workflows that:1.1. Reuse the generic pipelineReuse the generic pipeline2.2. Extend the generic pipelineExtend the generic pipeline3.3. Create new pipelinesCreate new pipelines
Use Power Point, Visio, or paper tablets…Use Power Point, Visio, or paper tablets…your choice!your choice!