View
156
Download
1
Category
Tags:
Preview:
DESCRIPTION
eScience Research Round Table (ERRT) @ GSLIS on Data Curation for Biodiversity Informatics
Citation preview
KURATOR: A Provenance-enabled Workflow Platform and Toolkit to Curate
Biodiversity Data
Bertram Ludäscher
Graduate School of Library and Information Science (GSLIS)National Center for Supercomputing Applications (NCSA)
ERRT @ GSLIS 10/22/2014 2
• Kurator:– What problems is Kurator tackling and for whom? – Curation Workflow Example– How we’re going about it
• Not Today:– Related Biodiversity Informatics Projects
• Filtered-Push• Exploring Taxon Concepts (ETC)• Euler
– Other Informatics Projects• DataONE• SKOPE
Outline
ERRT @ GSLIS 10/22/2014 3
What is Kurator?
• NSF-DBI #1356751 – Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data
– Sept. 2014 – 2017– @Illinois:
• B. Ludäscher, James Macklin, Tim McPhillips, …
– @Harvard: • James Hanken, Paul Morris, Bob Morris, …
ERRT @ GSLIS 10/22/2014 4
Problem: Data & Metadata Quality• Collections & occurrence data is
all over the map– … literally (off the map!)
• Issues:– Lat/Long transposition,
coordinate & projection issues– Scientific Names (spelling
errors, other) – Data entry/creation, “fuzzy”
data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)
• Precursor:– Filtered-Push Collaboration
ERRT @ GSLIS 10/22/2014 5
What Problems does Kurator try to solve?
• Detect and flag data quality issues
• Repair if possible
• Keep track of provenance– automatic repairs– human curator edits
ERRT @ GSLIS 10/22/2014 6
Who are the customers?
• Collection Managers – … who are managing the collections databases– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers– To perform an analysis in the presence of (partially)
dirty data, researchers need to• Clean or fix dirty data• Throw out unfixable data
– Pushing changes to the original data collections and collection managers (cf. FPush)
ERRT @ GSLIS 10/22/2014 7
Example: Kepler/Kurator (FPush project)
ERRT @ GSLIS 10/22/2014 8
Simplified Example Workflow
• Related Research (Tianhong Song, UC Davis)– Analyze linear workflow “story”– Use patterns to discover wf design issues
(e.g. use before update); then fix them– Parallelize when possible
• Kurator:– Allow easy assembly
of such workflows– For tool makers– … and tool users – … scalability
challenge.
ERRT @ GSLIS 10/22/2014 9
Example Output …
ERRT @ GSLIS 10/22/2014 10
… close up …
ERRT @ GSLIS 10/22/2014 11
How we do it
• Build a library of curation services such that curation workflows can be run from various platforms– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms• e.g. Akka, Python-based, …
• … leveraging existing technologies
ERRT @ GSLIS 10/22/2014 12
How we do it
• Open source, community-friendly approach– git repository (NCSA open source projects)
• Agile software development– NCSA support tools, e.g. JIRA, Bamboo
• Inspired by – Small bioinformatics tools manifesto (post-facto)– Unix tenets (small, interoperable tools, … )– Experience with other (sometimes not so agile)
development projects
ERRT @ GSLIS 10/22/2014 13
Kurator: Agile Development
ERRT @ GSLIS 10/22/2014 14
Q & A …
• What does data curation, quality control mean in you domain / application / research?
• Are there particular issues that are important to you?
• Join us!– Kurator & other Biodiversity Interest
• Hackers welcome, too.
– Email: ludaesch@illinois.edu
ERRT @ GSLIS 10/22/2014 15
Related Research (Tianhong Song)
• Automated Design, Analysis, Optimization of Curation Workflows.
• Idea:
• Example Workflow[Scientific Name Validation] [GeoRef Validation] [Date Validation]
ERRT @ GSLIS 10/22/2014 16
Related Research (Tianhong Song)
• Analyze linear workflow “story”
• Use patterns to discover wf design issues (e.g. use before update); then fix them
• Parallelize when possible
Recommended