Workshop Exercise: Text Analysis Methods for Digital Humanities

Helen Bailey and Sands Fish, MIT Libraries

1

Text Analysis Methods for Digital Humanities Workshop Exercise

MALLET Pre-workshop, students should download and install the MALLET GUI on their laptops. https://code.google.com/p/topic-modeling-tool/ They should also run the topic modeler on a sample text file with the default settings to make sure it’s working correctly.

Helpful MALLET Resources

• MALLET GUI information • Blog post on using the GUI and displaying output in Gephi • Using MALLET on the command line • Intro to Topic Modeling in general • MALLET website • Review of MALLET in Journal of Digital Humanities • Using HO-LDA / Finding Number of Topics in Emergency Text classification

In-Class Exercise

1. Run MALLET on a known corpus (full-text examples used in the demo, all from the Gutenberg Project: Adventures of Huckleberry Finn, Alice’s Adventures in Wonderland, Andersen’s Fairy Tales, Grimm’s Fairy Tales, Life on the Mississippi, On the Origin of Species, The Wizard of Oz).

2. Change the parameters to see how they impact the results. For example: • Does preserving case matter? • How does changing the number of iterations impact the results? • What about changing the topic proportion threshold? • How many topic words should you print? (What are you trying to discover? How

much info is useful?) • What do the results tell you about this corpus? How could you use this to learn

about a corpus you weren’t familiar with?

3. MALLET implements the LDA algorithm, discuss its details a little. Hierarchicial Topic Modeling as a juxtaposition (not available via MALLET)

Helen Bailey and Sands Fish, MIT Libraries

2

Stanford Named Entity Recognizer Pre-workshop, students should download and install the SNER GUI on their laptops. http://nlp.stanford.edu/software/CRF-NER.shtml#Download They should also run SNER on a sample file using the default classifier (or, if that’s not available, the first classifier in the classifier folder), to make sure it’s working correctly.

Helpful SNER Resources

• Basic GUI tutorial

In-Class Exercise • Run SNER on a known corpus. Change the classifier to see if results differ. • Save tagged file output and open. What do you then need to do with that to make it

useful? • Difference between entity extraction and entity disambiguation. • Do we want to have them run the output through a concordance program? • What might you do with this data? How could it interact with other tools to tell the

narrative?

CLAVIN • CLAVIN Tool by Berico Technologies

o Cartographic Location And Vicinity INdexer • MIT Center for Civic Media open source CLAVIN Server for doing geo-parsing via HTTP

o Includes special "civic sauce" for determining the "aboutness" of a document, narrowing down to the most likely place a document is talking about.

o According to Civic, this is the best quality geo-parsing service outside of Yahoo's pay service.

• Uses ApacheNLP for location entity extraction under the hood.

Setup 1. Download source from https://github.com/sandsfish/CLAVIN-Server 2. Follow the instructions in the readme to build and setup the tool.

Evaluating Assumptions

● We’re providing sample text to work with. What do you already know about it? What do you know from the data itself, and what information are you lacking?

● What characteristics of the sample data are likely contributing to the results you get from these tools? (Lack of pre-processing, for example)

● Note how long it takes for these tools o run. Consider the size of the data set we’re working with versus the size of possible data sets you may be interested in.

Education

Workshop Exercise: Text Analysis Methods for Digital Humanities