E-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre

e-Science:e-Science:

Stuart Anderson National e-Science Centre

Stuart Anderson National e-Science Centre

Cool White DwarvesCool White Dwarves

Issues 1Issues 1

• Astronomers are looking for:– Many objects in globular clusters– Very faint objects– Interested in observations of many locations

• But:– The observations are noisy:

• Artifacts created by the sensor technology, scanning and digitizing.

• Junk in orbit, e.g. satellite tracks.

• Computer Science can help:- Pattern recognition, computational learning,

data mining.

- But: Astronomers are more picky.

• Astronomers are looking for:– Many objects in globular clusters– Very faint objects– Interested in observations of many locations

• But:– The observations are noisy:

• Artifacts created by the sensor technology, scanning and digitizing.

• Junk in orbit, e.g. satellite tracks.

• Computer Science can help:- Pattern recognition, computational learning,

data mining.

- But: Astronomers are more picky.

Cool Dwarves are faint and closeCool Dwarves are faint and close

•The sky is full of faint objects.

• Cool White Dwarves are close.

• So they move about relative to the background stars.

• The illustrated observations cover a period of 30 years.

• We need to match up very faint objects observed by different equipment at different times.

Issues 2Issues 2

• Astronomers have a model of how luminous CWDs are that predicts how distant they are and hence how they move over time.

• We can use computational learning (aka data mining) to recognize CWDs provided we have a model that allows tractable learning.

• We can use the model to create training cases for various learning techniques.

• Astronomers also want to observe the same objects at different wavelengths.

• Models of objects can be used as a basis for data mining to link observations.

• Astronomers have a model of how luminous CWDs are that predicts how distant they are and hence how they move over time.

• We can use computational learning (aka data mining) to recognize CWDs provided we have a model that allows tractable learning.

• We can use the model to create training cases for various learning techniques.

• Astronomers also want to observe the same objects at different wavelengths.

• Models of objects can be used as a basis for data mining to link observations.

Problem ScaleProblem Scale

• Cosmos (old technology), megabytes per plate.

• Super Cosmos (current technology), gigabytes per plate.

• Cosmos and Super Cosmos use 1m telescope images

• Vista (new technology): imaging in visible and x-ray using digital detectors, 4m telescope, terabytes per night.

• Sky surveys look at large-scale structure of space so many images are involved e.g. to estimate the density of CWDs in the galaxy.

• Cosmos (old technology), megabytes per plate.

• Super Cosmos (current technology), gigabytes per plate.

• Cosmos and Super Cosmos use 1m telescope images

• Vista (new technology): imaging in visible and x-ray using digital detectors, 4m telescope, terabytes per night.

• Sky surveys look at large-scale structure of space so many images are involved e.g. to estimate the density of CWDs in the galaxy.

E-Science and Old ScienceE-Science and Old Science

• Computational models have been used for many years.

• e-Science systems will include vast collections of observed data.

• Scientific models are the essential organizing principle for data in such systems.

• Currently we are hand-crafting models that organise subsets of the data (e.g. CWDs).

• Can we create experimental environments that allow scientists to create new models of phenomena and test them against data?

• Computational models have been used for many years.

• e-Science systems will include vast collections of observed data.

• Scientific models are the essential organizing principle for data in such systems.

• Currently we are hand-crafting models that organise subsets of the data (e.g. CWDs).

• Can we create experimental environments that allow scientists to create new models of phenomena and test them against data?

Data, Information and KnowledgeData, Information and Knowledge• Much Grid work identifies a three-layer

architecture for data.• Data is the raw data acquired from

sensors (e.g. telescopes, microscopes, particle detectors).

• Information is created when we “clean up” data to eliminate artifacts of the collection process.

• Knowledge is information embedded within an interpretive framework.

• Science provides strong interpretive frameworks

• Much Grid work identifies a three-layer architecture for data.

• Data is the raw data acquired from sensors (e.g. telescopes, microscopes, particle detectors).

• Information is created when we “clean up” data to eliminate artifacts of the collection process.

• Knowledge is information embedded within an interpretive framework.

• Science provides strong interpretive frameworks

Pattern: More science “in silico”Pattern: More science “in silico”• Improved sensors, more sensors, huge

increase in data volume.• Need to “clean”, “mine” structure data.• Support complex models and large-scale

data collections inside the computer(s)• Support for flexible model development

and using models to organise and access data.

• E.g. in databases, spatial organisation, temporal organisation and support for queries exploiting that structure – useful for Geoscience?

• Improved sensors, more sensors, huge increase in data volume.

• Need to “clean”, “mine” structure data.• Support complex models and large-scale

data collections inside the computer(s)• Support for flexible model development

and using models to organise and access data.

• E.g. in databases, spatial organisation, temporal organisation and support for queries exploiting that structure – useful for Geoscience?

CreditsCredits

• Cosmos, Super Cosmos and Vista are projects looking at large scale structure of the cosmos, based at the Royal Observatory Edinburgh.

• Chris Williams, Bob Mann and Andy Lawrence are working on using computational learning to analyse super Cosmos data at RoE.

• Andy Lawrence is director of the AstroGrid project that is a major UK contribution to the international “Virtual Observatory” that will federate the worlds major astronomical data assets.

• Cosmos, Super Cosmos and Vista are projects looking at large scale structure of the cosmos, based at the Royal Observatory Edinburgh.

• Chris Williams, Bob Mann and Andy Lawrence are working on using computational learning to analyse super Cosmos data at RoE.

• Andy Lawrence is director of the AstroGrid project that is a major UK contribution to the international “Virtual Observatory” that will federate the worlds major astronomical data assets.

Whither Data Management?Whither Data Management?

• Scientific data is not particularly well behaved.

• In particular, it does not fit the relational model particularly well.

• We need new data models that are better suited to the needs of science (and everyone else too!).

• The model should attempt to support the work of scientists effectively.

• Current data models are not particularly useful.

• Scientific data is not particularly well behaved.

• In particular, it does not fit the relational model particularly well.

• We need new data models that are better suited to the needs of science (and everyone else too!).

• The model should attempt to support the work of scientists effectively.

• Current data models are not particularly useful.

Curated DatabasesCurated Databases

• Useful scientific databases are often curated : they are created/ maintained with a great deal of “manual” labour.

• Useful scientific databases are often curated : they are created/ maintained with a great deal of “manual” labour.

select xyzfrom pqrwhere abc

Database people’s idea of what happens

What really happens

DB1 DB2

Inter-dependence is ComplexInter-dependence is Complex

GERD

TRRD

GenBank

Swissprot

EpoDB

TransFac

GAIA

BEAD

A few of the 500 or so public curated molecular biology databases

Issues in Curated DatabasesIssues in Curated Databases

• Data integration (always a problem). Need to deal with schema evolution

• Data provenance. How do you track data back to its source (this information is typically lost)

• Data annotation. How should annotations spread through this network?

• Archiving. How do you keep all the archives when you are “publishing” a new database every day?

• Data integration (always a problem). Need to deal with schema evolution

• Data provenance. How do you track data back to its source (this information is typically lost)

• Data annotation. How should annotations spread through this network?

• Archiving. How do you keep all the archives when you are “publishing” a new database every day?

ArchivingArchiving

• Some recent results on efficient archiving (Buneman, Khanna, Tajima, Tan)

• OMIM (On-line Mendelian Inheritance in Man) is a widely used genetic database. A new version is released daily.

• Bottom line, we can archive a year of versions of OMIM with <15% more space than the most recent version

• Some recent results on efficient archiving (Buneman, Khanna, Tajima, Tan)

• OMIM (On-line Mendelian Inheritance in Man) is a widely used genetic database. A new version is released daily.

• Bottom line, we can archive a year of versions of OMIM with <15% more space than the most recent version

A Sequence of VersionsA Sequence of Versions

“Pushing” time down“Pushing” time down

[Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]

The final result(for the randomlyselected data)

Predicted expansion for a year’s archive: < 15%

Summary: technical issuesSummary: technical issues

• Why and where:– better characterization of where (new ideas

needed)– negation/aggregation

• Keys:– inference rules for relative keys– foreign key constraints– interaction between keys and DTDs/types

• Types for deterministic model (and other models).

• Annotation• Temporal QLs and archives

• Why and where:– better characterization of where (new ideas

needed)– negation/aggregation

• Keys:– inference rules for relative keys– foreign key constraints– interaction between keys and DTDs/types

• Types for deterministic model (and other models).

• Annotation• Temporal QLs and archives

Pattern: Better support for workPattern: Better support for work• Data is increasingly complex and

interdependent.• “Curating” the data is continuous, and

involves international effort to increase the scientific value of the data.

• Understanding the way we work with data is the key to providing adequate support for that work.

• Deeper support for projects working across the globe.

• Data is increasingly complex and interdependent.

• “Curating” the data is continuous, and involves international effort to increase the scientific value of the data.

• Understanding the way we work with data is the key to providing adequate support for that work.

• Deeper support for projects working across the globe.

CreditsCredits

• These issues are being addressed by Peter Buneman at Edinburgh.

• Peter has recently joined Informatics and NeSC.

• He has worked for a number of years on Digital Libraries and Biological Data Management.

• These issues are being addressed by Peter Buneman at Edinburgh.

• Peter has recently joined Informatics and NeSC.

• He has worked for a number of years on Digital Libraries and Biological Data Management.

Documents

E-Science: Stuart Anderson National e-Science Centre Stuart Anderson National e-Science Centre