Upload
paige-morgan
View
1.149
Download
1
Embed Size (px)
Citation preview
Data Wrangling II:Programming on the Whiteboard
February 26, 2016Paige Morgan
Digital Humanities Librarian
Starting Activity:Open Syllabus Project
http://opensyllabusproject.org/
Open Syllabus Project•Use the syllabus explorer to
examine the data.•Keep track of each step you take
as you drill down.•Goal: develop a research question
based on your explorations.•What other data would you need
to answer this research question?
Last week...•The work of creating usable data•Forms that this data might take:
•markup language•Spreadsheets (MySQL & relational
DBs)•Non-relational databases
(RDF/Linked Open Data
This week:•Caveat Curator (challenges of
working with data)•Programming on the Whiteboard,
i.e., conceptualizing the specific steps that you need to take to accomplish your goals
Goals/Takeaways•A better understanding of the
workflow for dealing with data•How to start small and scale
up effectively•Greater ability to talk about
what you’re trying to do
Why this focus on data?•Understanding your data, and
your intended actions, is a key skill for developing any digital project (big or small).
•You may have one big project – but your data may support several small/intermediary projects.
Image: Josh Lee, @wtrsld, via Twitter, January 2014.
What if your data is crowdsourced?
You can require a particular format for
submissions
You can even put programmatic limits on
the formats available for submission
But in the end, you’re probably still going to need to scrub and/or
format.
This is true even for data from supposedly reputable sources, like government or media
organizations.
Example: Doctor Who Villains dataset
http://tinyurl.com/doctorwhovillains
Data Dictionaries
If you are thinking about your data, and the tasks
that you need to accomplish, then it’s
easier to determine what sort of language or
platform your project needs.
Pseudocode•Used by programmers to break
down a complex task into single steps
•Easily adaptable for use by non-programmers
Pseudocode Example (Visible Prices)• Computer has a file that contains prices from
different texts.
• Computer must know that each price amount is connected with an object, and with a bibliographical record.
• Users can input a price amount, and computer will retrieve all objects that match the price, and display them to the user, along with bibliographical information.
• (More complex): Computer is able to retrieve prices linked with certain categories (clothing, food, etc.)
It is likely that your data will have a longer life span than any specific
project you create.
In many instances, it may be more useful to
focus on the data curation as much as a
single project.
Getting Data•Figshare•Datahub.io•Project websites•APIs
Cleaning Data
•OpenRefine http://openrefine.org/
Key DH Values•Adaptive•Sustainable/resource-aware•Collaborative•Social
Key skills•Thinking flexibly about your data (and potential project)
•Are there portions of your dataset that could be extracted for use in a particular tool?
•How can you adjust your data in order to show it to people (and be more able to talk/write/present about your research interests?)
And now, it’s your turn...
Group Activity•What questions can you ask and
answer with this data as it is?•What data would you need in
order to ask & answer other research questions?
•What are the steps that you would need to take in order to answer those research questions?
Next steps•What’s the smallest version of your dataset possible? (useful for testing out tools)
•Possible tools to examine (as ways of presenting your data)• Omeka (http://www.omeka.net)
• Scalar (http://scalar.usc.edu)
• Simile (http://www.simile-widgets.org)
• Google Fusion Tables (https://support.google.com/fusiontables/answer/2571232)
Thank you!
•Questions? Ideas? Book a consult at http://paigecmorgan.youcanbook.me