Upload
ekansa
View
693
Download
3
Embed Size (px)
DESCRIPTION
A presentation given at the "Data Stewardship: Increasing the Integrity and Effectiveness of Science and Scholarship" Session on Friday, June 8 2012 at the IASSIT 2012 conference in Washington DC. This presentation introduced data publishing, using a social science (archaeology) case study to explore editorial processes and dissemination outcomes that increasingly demand “Linked Data” capabilities.
Citation preview
Case-Study: Publishing to the “Web of Data” in Archaeology
Quality and Workflows
Eric Kansa UC Berkeley / OpenContext.org
Unless otherwise indicated, this work is licensed under a Creative Commons Attribution 3.0 License <http://creativecommons.org/licenses/by/3.0/>
“Small Science” data sharing is hard:(1) Complexity(2) Scalability(3) Ethics, cultural property
claims, IP(4) Incentives(5) Preservation
Image Credit: “Grand Canyon NPS” via Flickr (CC-By)http://www.flickr.com/photos/grand_canyon_nps/5975537378/
Thousand Flowers
● Open Context: Open access, open licensed data for arhaeology
● Archiving by California Digital Library
● Persistent Identifiers (DOIs, ARKs)
● Web services● NSF/NEH links for data
management plans
Thousand Flowers
Fills a Gap:
Most data sources are institutional. Open Context publishes individual, small group contributions
Thousand Flowers
Fills a Gap:
Most data sources are institutional. Open Context publishes individual, small group contributions
Challenge:Diverse contributions, needing lots of work to clean-up and “link” to the Web of Data
• 3-year project Oct 2010 – Sep 2013
• Funded with a National Leadership Grant from the Institute for Museum and Library Services, LG-06-10-0140-10, “Dissemination Information Packages for Information Reuse”
• Ixchel Faniel, PI & Elizabeth Yakel, Co-PI
http://www.dipir.org
DIPIR Collaboration
The Big DIPIR Questions
Research Questions
1. What are the significant properties of data that facilitate reuse by the designated communities at the three sites?
2. How can these significant properties be expressed as representation information to ensure the preservation of meaning and enable data reuse?
Open Context Interviewees
• 22 Ph.D. or graduate students interviewed
– 13 men– 9 women
• Novices / Experts– 19 experts– 3 novices
• Interviewees who where curators or professors also with a curatorial role = 6
Raw Data is Unappetizing?
Data Documentation PracticesI use an Excel spreadsheet…which I … inherited from my research advisers. …my dissertation advisor was still recording data for each specimen on paper when I was in graduate school so that's what I started …then quickly, I was like, "This is ridiculous.“… I just started using an Excel spreadsheet that has sort of slowly gotten bigger and bigger over time with more variables or columns…I've added …color coding…I also use…a very sort of primitive numerical coding system, again, that I inherited from my research advisers…So, this little book that goes with me of codes which is sort of odd, but …we all know that a 14 is a sheep.” (CCU13)
Data Documentation PracticesI use an Excel spreadsheet…which I … inherited from my research advisers. …my dissertation advisor was still recording data for each specimen on paper when I was in graduate school so that's what I started …then quickly, I was like, "This is ridiculous.“… I just started using an Excel spreadsheet that has sort of slowly gotten bigger and bigger over time with more variables or columns…I've added …color coding…I also use…a very sort of primitive numerical coding system, again, that I inherited from my research advisers…So, this little book that goes with me of codes which is sort of odd, but …we all know that a 14 is a sheep.” (CCU13)
A long way to go before we get usable, intelligible data
Sometimes data is better served cooked.
Thousand Flowers
● Clean-up and document contributed data
● Map to ArchaeoML (general ontology)
● Mint URIs to entities (potsherds, projects, contexts, people)
● Link to important vocabularies / collections (Pleiades, Encyclopedia of Life)
● Working on CIDOC-CRM (RDF) representations (not straightforward)
Open Context: Record
Open Context: Record
● XHTML + RDFa (Dublin Core, Open Annotation, etc.)
● XML (ArchaeoML)● Atom● RDF (draft CIDOC)● Link to GitHub versioned file
Open Context: Record
Open Context: Record
Open Context: Visutalization of Data Linked to the EOL
My Precious Data
Image Credit: “Lord of the Rings” (2003, New Line), All Rights Reserved Copyright
Data sharing as publication
Data Publishing
Data Quality and Standards Alignment(1) Check consistency(2) Edit functions(3) Align to common standards
(“Linked Data” if applicable)(4) Issue tracking, version
control
Publishing
Tools of the Trade
(1) Google Refine (check, edit, consistancy)
(2) Mantis (issue-tracker, coordinate edits, metadata creation)
Publishing
Tools of the Trade
(1) Domain scientists (Editorial Board) check data
(2) Iterative “coproduction” between contributors and editoris
Publishing
Publishing
Project Metadata
Column Descriptions
Web of Data (2011)
Main Contributors:
● Institutions (esp. government)
● Thematic collections / projects
Entity Reconciliation
(1) With Google Refine(2) Implemented, EOL and
Pleiades (gazetteer)(3) Use existing mappings to
improve future reconciliation
Publishing
● CDL Archiving Service● EZID for persistent Identity: DOIs
(aggregate resources), ARKs (granular resources) and Merritt Repository
● Helps build trust in community
● Platform / Services disciplinary communities can use for “Data Publishing”
● Different communities work out semantic/interoperability needs, editorial policies, incentives, etc.
University of California (System) Repository,
All disciplines(UC-funded library, grants)
CDL as Infrastructure
● Platform / Services disciplinary communities can use for “Data Publishing”
● Different communities work out semantic/interoperability needs, editorial policies, incentives, etc.
University of California (System) Repository,
All disciplines(UC-funded library, grants)
CDL as InfrastructureFuture data publisher
Future data publisher
eScholarship: UC’s OA Publishing Platform
Platform for traditional publishing
Also supports new genres
Outcomes of Publishing Data:(1) Communicate and set
expectations about content and quality
(2) Organize workflows to improve data quality and usability
(3) Make “datasets” first class citizens in world of scholarly communications
Summary
Final Thoughts
Publication needs to evolve!
(1) Participating in Linked Data is a great goal, but far removed from most everyday practice
(2) Researchers need help.
(3) 19th century publication norms poorly suited to 21st century methods, research, public goals