17
Text Analysis Methods for Digital Humanities Helen Bailey and Sands Fish MIT Libraries

Text Analysis Methods for Digital Humanities

Embed Size (px)

DESCRIPTION

Slides from text analysis methods for digital humanities workshop, taught by Helen Bailey and Sands Fish for MIT course CMS.633: Digital Humanities in February, 2014.

Citation preview

Page 1: Text Analysis Methods for Digital Humanities

Text Analysis Methods for Digital Humanities

Helen Bailey and Sands Fish MIT Libraries

Page 2: Text Analysis Methods for Digital Humanities

Examples of Data Narratives

•  Visualizing Emancipation

•  Narrative Visualization of Whaling Ship Logs

•  Out of Sight, Out of Mind

Page 3: Text Analysis Methods for Digital Humanities

Approaches to Storytelling w/ Data •  EDA - Exploratory Data Analysis •  Exploring data from a number of perspectives:

o  Temporal o  Geographical o  Statistical o  Categorical o  Relational

•  80% - Data Hacking, 20% - Narrative Construction, Visualization, etc.

Page 4: Text Analysis Methods for Digital Humanities

"To use any sort of historical data, we must above all understand the constraints under which it was collected. In this case, that means retelling the history of why and how the ship's logs were first collected, and how the constraints of digitization in the punch card era radically shape the sort of evidence we can draw from them. The important thing about this sort of work is that it helps us understand the overall biases of a particular data set, which is crucial for limiting our interpretive leaps."

- Ben Schmidt, “Reading digital sources: a case study in ship's logs”

Page 5: Text Analysis Methods for Digital Humanities

Inherent Biases & Limitations •  Data capture methods and format •  Purpose of data collection •  Transformation over time •  Authenticity and trust

Understand provenance

Page 6: Text Analysis Methods for Digital Humanities

“Rather than replace humans, computers amplify human abilities. The most productive line of inquiry, therefore, is not in identifying how automated methods can obviate the need for researchers to read their text. Rather, the most productive line of inquiry is to identify the best way to use both humans and automated methods for analyzing texts.”

- Justin Grimmer and Brandon M. Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”

Page 7: Text Analysis Methods for Digital Humanities

Acquiring Text •  Full-text resources:

o  DSpace@MIT http://dspace.mit.edu/ o  Dome http://dome.mit.edu/ o  Digital Public Library of America http://dp.la o  Europeana http://www.europeana.eu/portal/ o  HathiTrust http://www.hathitrust.org/

•  http://libguides.mit.edu/apis - metadata only •  http://libguides.mit.edu/digitalhumanities

Page 8: Text Analysis Methods for Digital Humanities

Data Management and Sharing •  Assumption of sharing and data management plan as a

funding requirement •  Data storage options - anticipate interaction

o  Storage formats - non-proprietary and repurposable whenever possible

o  File system storage vs. database •  Documentation of process http://libraries.mit.edu/guides/subjects/data-management/

Page 9: Text Analysis Methods for Digital Humanities

Formatting / Pre-Processing •  Tool input requirements •  Assumptions:

o  Text as a “bag of words” o  Unigrams, bigrams o  Word order (or not) o  Stop words, capitalization, punctuation

Page 10: Text Analysis Methods for Digital Humanities

Featurizing Text •  Each word becomes a feature •  This is called "high dimensional" data •  Each word is a "dimension", or "feature" •  Features are represented as vectors in Euclidean space •  Euclidean mathematics scales beyond 3 dimensions

Page 11: Text Analysis Methods for Digital Humanities

The Shape of Data •  Data structures and formats •  Informed (in part) by:

o  Tools o  Co-occurrence o  Data output formats o  Entity type o  Temporal, geographical perspective, etc.

Page 12: Text Analysis Methods for Digital Humanities

Validation

From Ben Schmidt’s “Machine Learning at Sea”

Page 13: Text Analysis Methods for Digital Humanities

Network Models •  Representing data as a network

o  Types: technological, communication, transportation, energy, airplane routes, web linking patterns

o  social §  non-human animal interaction §  membership in larger groups §  sexually transmitted diseases §  co-authorship of scientific publications §  trade agreements between nations

•  Mapping the News - Berkman's Controversy Work o  Spidering o  Influential actors over time

Page 14: Text Analysis Methods for Digital Humanities

Topic Modeling Tools •  MALLET

o  Can run on unstructured plain text files o  http://mallet.cs.umass.edu/topics.php

•  Stanford Topic Modeling Toolbox

o  Requires data in a CSV or TSV file o  http://nlp.stanford.edu/software/tmt/tmt-0.4/

Page 15: Text Analysis Methods for Digital Humanities

Entity Extraction •  Identifies known entities in specific categories

o  Locations o  People o  Organizations o  Dates/times

•  Creates annotated text from unstructured text •  Domain-specific

Page 16: Text Analysis Methods for Digital Humanities

Entity Extraction Tools •  Stanford Named Entity Recognizer

http://nlp.stanford.edu/software/CRF-NER.shtml

•  Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/download_view/NETagger

•  DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki

Page 17: Text Analysis Methods for Digital Humanities

Geo-Parsing •  Common Pitfalls

o  Set of places (GeoNames dictionary) o  Dictionary determines how broad or narrow your

search is

•  Enhancements to CLAVIN by Civic Media o  Aboutness (uses mention counting) o  HTTP access used for more advanced workflows