Upload
angel-douglas
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
UKOLN is supported by:
From research data to new knowledge: a lifecycle approach.
Dr Liz Lyon, Director
UKOLN, University of Bath, UK
JISC/SURF/CNI Conference May 2005, Amsterdam.
www.bath.ac.uk
a centre of expertise in digital information management
www.ukoln.ac.uk
JISC/SURF/CNI Conference May 2005 2
Overview
1. Scholarly communications in flux
2. e-Research and the diversity of data
3. Repositories & meta-functionality• Realising the link to learning: eBank UK• Providing value-added services• Enabling knowledge extraction & post-
processing
4. Look at (some of) the issues en route
1. Scholarly communications in flux
JISC/SURF/CNI Conference May 2005 4
A medieval scriptorium…..
JISC/SURF/CNI Conference May 2005 5
Research & e-Science workflows
Aggregator services: national, commercial
Repositories : institutional, e-prints, subject, data, learning objects
Harvestingmetadata
Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media
Deposit / self-archiving
Peer-reviewed publications: journals, conference proceedings
Publication
Validation
Data analysis, transformation, mining, modelling
Searching , harvesting, embedding
Presentation services: subject, media-specific, data, commercial portals
Resource discovery, linking, embedding The scholarly knowledge
cycle.
Liz Lyon, Ariadne, July 2003.
JISC/SURF/CNI Conference May 2005 6
Learning & Teaching workflows
Aggregator services: national, commercial
Repositories : institutional, e-prints, subject, data, learning objects
Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules
Harvestingmetadata
Resource discovery, linking, embedding
Peer-reviewed publications: journals, conference proceedings
Validation
Resource discovery, linking, embedding
Deposit / self-archiving
Learning object creation, re-use
Searching , harvesting, embedding
Quality assurance bodies
Validation
Presentation services: subject, media-specific, data, commercial portals
JISC/SURF/CNI Conference May 2005 7
Learning & Teaching workflows
Research & e-Science workflows
Aggregator services: national, commercial
Repositories : institutional, e-prints, subject, data, learning objects
Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules
Harvestingmetadata
Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media
Resource discovery, linking, embedding
Deposit / self-archiving
Peer-reviewed publications: journals, conference proceedings
Publication
Validation
Data analysis, transformation, mining, modelling
Resource discovery, linking, embedding
Deposit / self-archiving
Learning object creation, re-use
Searching , harvesting, embedding
Quality assurance bodies
Validation
Presentation services: subject, media-specific, data, commercial portals
Resource discovery, linking, embedding
2. e-Research and the diversity of data
JISC/SURF/CNI Conference May 2005 9
Assuring permanent open access to the records of science & the humanities?
Long term access to primary data
• Increasing data volumes from eScience and Grid-enabled / cyberinfrastructure applications
• Changing research paradigm: data-driven science, “big science”
• Observational data, simulations, large-scale experimentation, computations
• Multi-media resources, statistical data, surveys, geo-spatial data……
JISC/SURF/CNI Conference May 2005 10
Diversity of data collections• Very large, relatively homogeneous: Large-scale Hadron
Collider (LHC) outputs from CERN• Smaller, heterogeneous and richer collections: World Data Centre for
Solar-terrestrial Physics CCLRC• Small-scale laboratory results: “jumping robots” project
at the University of Bath• Population survey data: UK Biobank
• Highly sensitive, personal data: patient care records
JISC/SURF/CNI Conference May 2005 11
Taxonomy of data collections• Research collections:
jumping robots • Community collections:
Flybase at Indiana (with UC Berkeley )
• Reference collections: Protein Data Bank
Source: NSF Long-Lived Digital Data Collections
Draft report March 2005
JISC/SURF/CNI Conference May 2005 12
Taxonomy of data collections• Research collections:
jumping robots • Community collections:
Flybase at Indiana (with UC Berkeley )
• Reference collections: Protein Data Bank
Source: NSF Long-Lived Digital Data Collections
Draft report March 2005
Evolution……
JISC/SURF/CNI Conference May 2005 13
Repository evolution:
1971 Research collection
<12 files
2005 Reference collection
>2700 structures deposited in 6 months
JISC/SURF/CNI Conference May 2005 14
1. Issues: research data as content
• Sharing it!• Data diversity
– Homo- or heterogeneous– Raw and derived / processed – Sensitivity– Fast or slow growth in volume
• Repository evolution: – Likelihood to scale up (from bytes to petabytes)– Quality assurance (from the start)– Community-based standards development
(“folksonomies”)– Build robust services
3. Repositories & meta-functionality
JISC/SURF/CNI Conference May 2005 16
eBank UK: linking research data to learning
• JISC-funded September 2003, Phase 2 February 2005• UKOLN at the University of Bath (lead), University of
Southampton, University of Manchester• Exemplar: e-Science testbed ‘Combechem’
– Grid-enabled combinatorial chemistry– Crystallography, laser and surface chemistry examples– Development of an e-Lab using pervasive computing technology– National Crystallography Service
• Resource Discovery Network / PSIgate physical sciences portal
• http://www.ukoln.ac.uk/projects/ebank-uk/
JISC/SURF/CNI Conference May 2005 17
Learning & Teaching workflows
Research & e-Science workflows
Aggregator services:
eBank UK
Repositories : institutional, e-prints, subject, data, learning objects
Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules
Harvestingmetadata
Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media
Resource discovery, linking, embedding
Deposit / self-archiving
Peer-reviewed publications: journals, conference proceedings
Publication
Validation
Data analysis, transformation, mining, modelling
Resource discovery, linking, embedding
Deposit / self-archiving
Learning object creation, re-use
Searching , harvesting, embedding
Quality assurance bodies
Validation
Presentation services: subject, media-specific, data, commercial portals
Resource discovery, linking, embedding
JISC/SURF/CNI Conference May 2005 18
Data Flow in eBank UK
OA
I-P
MH
Submit
Store/link
Harvest (XML)
Index and Search
Data files
Metadatapresent
HTML
present
HTML
Institutional repository
eBank aggregator
Create
Comb-e-Chem Project
X-Raye-Lab
Analysis
Properties
Propertiese-Lab
SimulationVideo
Diff
ract
omet
er
Grid Middleware
StructuresDatabase
JISC/SURF/CNI Conference May 2005 20
JISC/SURF/CNI Conference May 2005 21
The digital repository
ecrystals.chem.soton.ac.uk
Acknowledgement: Simon Coles
JISC/SURF/CNI Conference May 2005 22
Access to the underlying data
JISC/SURF/CNI Conference May 2005 23
Harvesting: OAIster
JISC/SURF/CNI Conference May 2005 24
Aggregating: search & discover
JISC/SURF/CNI Conference May 2005 25
Linking to publications
JISC/SURF/CNI Conference May 2005 26
eBank embedded in a science portal
JISC/SURF/CNI Conference May 2005 27
eBank Phase 2: linking to learning
• Embedding in e-Learning processes• Evaluating the pedagogical benefits
– MChem course
– Chemical informatics course
JISC/SURF/CNI Conference May 2005 28
2. Issues: generic data models, metadata schema & terminology
• Validation against other schema– CCLRC Scientific Data Model Vs 2
• Complex digital objects and packaging options – METS– MPEG 21 DIDL
• Terminologies– Domain: crystallography– Inter-disciplinary e.g. biomaterials– Metadata enhancement: subject keyword additions to datasets
based on knowledge of keywords in related publications – Meaningful resource discovery?
JISC/SURF/CNI Conference May 2005 29
3. Issues: linking and identifiers
• Links to individual datasets within an experiment• Links to all datasets associated with an experiment or a data
collection• Links to derived eprints and published literature • Context sensitive linking: find me
– Datasets by this author / creator– Datasets related to this subject– Learning objects by this author / creator– Learning objects related to this subject
• Identifiers and persistence– “generic” – domain: International Chemical Identifier (InChI code)
• Resource discovery : Google Scholar?• Provenance: authenticity, authority, integrity?
JISC/SURF/CNI Conference May 2005 30
4. Issues: embedding and workflow
• Into the crystallographic publishing community International Union of Crystallography
• Into the chemistry research workflow– SMART TEA Digital Lab Book e-synthesis Lab– Other analytical techniques and instrumentation
• Into the curriculum and e-Learning workflows– MChem course – Undergraduate Chemical Informatics courses
JISC/SURF/CNI Conference May 2005 31
For later use? In use now (and the future)?
Repositories and digital curation
Data preservation Data curation
Static Dynamic
“maintaining and adding value to a trusted body of digital information for current and future use”
JISC/SURF/CNI Conference May 2005 32
Provide value-added services
Annotation
• e-Lab books (Smart Tea Project in chemistry)
• Gene and protein sequences
JISC/SURF/CNI Conference May 2005 33
Enable “post-processing” and knowledge extraction
The acquisition of newly-derived information and knowledge from repository content
• Run complex algorithms over primary datasets
• Mining (data, text, structures)
• Modelling (economic, climate, mathematical, biological)
• Analysis (statistical, lexical, pattern matching, gene)
• Presentation (visualisation, rendering)
JISC/SURF/CNI Conference May 2005 34
JISC/SURF/CNI Conference May 2005 35
5. Issues: “knowledge services”• Layered over repositories
– Annotation– Mining, modelling, analysis– Visualisation
• Across multiple repositories– Grid enabled applications– Highly distributed, dynamic and collaborative
• Associated with curatorial responsibility– UK Digital Curation Centre
http://www.dcc.ac.uk
JISC/SURF/CNI Conference May 2005 36
Issues summary1. Research data is diverse, increasing rapidly in
volume and complexity
2. Repository collections are dynamic and evolve
3. Technical challenges associated with interoperability, persistence, provenance, resource discovery and infrastructure provision
4. Embedding in workflow is critical: scholarly communications, research practice, learning
5. Knowledge extraction tools will generate new discoveries based on repository content
6. Repository solutions must scale: M2M processing will become the norm……