Upload
olga-scrivner
View
2.652
Download
1
Embed Size (px)
Citation preview
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Visualization: Language Variation Suiteand Interactive Text Mining Suite
Olga Scrivner
Indiana University
LSU, April 2016
1 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Analysis and Visualization
“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what
we are looking for.” (Blei 2012)
“Mastery of quantitative methods is increasingly becoming avital component of linguistic training” (Johnson, 2008)
2 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Analysis and Visualization
“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what
we are looking for.” (Blei 2012)
“Mastery of quantitative methods is increasingly becoming avital component of linguistic training” (Johnson, 2008)
2 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Types
1 Structured Data
2 Unstructured Data
3 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Quantitative Analysis for Structured Data
Traditional Tools Linguistic Data
a. Categorical variable
b. Independence ofobservation
c. Normally distributed data
d. Large corpus size
a. Categorical, continuous,multivariate, ordinal
b. Correlated data
c. Unbalanced data
d. Small corpus size
4 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Visualization
Word Order in Latin (Passarotti et al., 2013)
Visual Analytics - “The science of analytical reasoningfacilitated by visual interactive interfaces” (Thomas et al.,2005)
5 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
New Tools of Linguistic Analysis (Baayen 2008,Tagliamonte 2014, Gries 2015)
1 Mixed Model:
A statistical regression model containing fixed effects(independent variables) and random effects (e.g.,individual- or word-specific effects).
Measures variability between subjects and correlation ofobservation within subjects
Can handle unbalanced data
6 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
New Tools of Linguistic Analysis (Baayen 2008,Tagliamonte 2014, Gries 2015)
2 Conditional inference trees and Random Forests
Uses predictive modeling
“Proves to be more stable than stepwise variable selectionapproaches available for logistic regression” (Strobl2009:325)
Can handle skewed data that often violate the assumptionsof regression approaches
7 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
RStudio and Shiny Application
1 R - a free programming language for statistical computingand graphics
2 RStudio - Integrated Development Environment: a sourcecode editor, an executor and a debugger
3 Shiny App - a web application framework for R
Computational power of R + Web interactivity
8 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Language Variation Suite (LVS) - a StatisticalShiny Application
1 From https://languagevariationsuite.wordpress.com/
download Labov’s data New York 1966(LabovData.csv) andCaracas data Bentivoglio & Sedano 1993 (CaracasData.csv)
2 Open LVS applicationhttps://languagevariationsuite.shinyapps.io/Pages
9 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Language Variation Suite - Introduction
1 Data in csv format (no spaces in column names)
10 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Files
2 Upload your file
11 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Descriptive Data Analysis - Table
Table displays your dataset and allows for filtering columns bya search word, or in descending/ascending order.
12 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Summary
Summary provides a quantitative summary for each variable,ex. frequency count, mean, median.
13 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Structure
1 factor - categorical values, ex. m/f (gender), 20-34/65+(age), low/high (economic level)
2 num - numerical values, ex. 0.95, 1.53 int - integer values, ex. 1, 2, 10
14 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data Subset
15 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Cross-Tabulation
Cross-tabulation is a useful feature to examine the distributionof your dependent variable.
16 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Cross-Tabulation
Saks (upper middle-class store), Macy’s (middle-class store), Klein
(working-class)17 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Cluster
Cluster Analysis allows you to classify your data intosub-groups (clusters), which are defined by your data. Items inthe same cluster will be very similar to one another.
Saks (upper middle-class store), Macy’s (middle-class store), Klein
(working-class)18 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
LVS - Inferential Analysis
Fixed Regression Model - ignoring individual variations(speakers or words) may lead to Type I Error:“a chance effect is mistaken for a real differencebetween the populations”
Mixed Regression Model - prone to Type II Error:“if speaker variation is at a high level, we cannotdiscern small population effects without a largenumber of speakers” (Johnson 2009, 2015)
19 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Regression Model Selection
20 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Model Output
21 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Interpretation
Dependent Variable: deletion and retention
By default - deletion is a reference value (alphabetically)
Results are interpreted for retention
22 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Interpretation
Lexical item Fourth has a negative effect on retention and issignificant
Normal style has a slightly negative effect on retention but itscoefficient is not significant
Macy’s and Saks have a positive and significant effect onretention. Saks (upper middle class store) is more significantthan Macy’s (middle class store)
23 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Conditional Tree
24 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Conditional Tree
Store is the most significant factor for R-use: Kleins (working class
store) - more R-deletion; Macy’s and Saks have a higher rate of
R-retention, which also depends on the lexical item (Floor shows
more retention than Fourth)25 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Random Forest
26 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Random Forest
The variable importance score demonstrates that Store is the most
important predictor, followed by Lexical Item. The variable is
irrelevant is its importance is around the zero and the cut-off value
(red dotted line).27 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Data with Token Frequency
Upload CaracasData.csv fromhttps://languagevariationsuite.wordpress.com/
28 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Tokens
29 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Let’s Have a Short Break
30 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Visual Analytics for Digital Humanities
The “epic transformation of archives” - shifting from print todigital archival form (Folsom, 2007)
31 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Digital Humanity Manifesto 2.0 (2009) and Berry(2011)
1st Wave: “The first wave of digital humanities work wasquantitative, mobilizing the search and retrievalpowers of the database, automating corpuslinguistics, stacking hypercards into criticalarrays”
2nd Wave: “The second wave is qualitative, interpretive”,concentrating on new tools for creating andcurating digital repositories (Berry, 2011)
3rd Wave: Concentration on the computationality, search,retrieval and analysis originated inhumanity-based work
32 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
New Ways of Exploring Data Collections
Graphs, maps and trees for literature analysis (Moretti,2005)
33 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Visual Analytics
Word clouds to analyze a novel (Vuillemot et al., 2009)
34 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Visual Analytics
Social network graphs of characters in Greek tragedies(Rydberg-Cox, 2011)
35 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Visual Analytics
Literary fingerprint and summaries (Oelke et al., 2012)
36 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Visual Analytics
Tracking emotion and sentiment in fairy tales(Mohammad, 2012)
37 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Topic Modeling
Discovering underlying theme of collection from Science magazine1990-2000 (Blei, 2012)
For more information on topic modeling:http://www.matthewjockers.net/2011/09/29/
the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/38 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Interactive Text Mining Suite - Introduction
1 Download 3 text files (dante01.txt, dante02.txt,dante03.txt) fromhttps://languagevariationsuite.wordpress.com/
(workshop)
2 ITMS Application:https://languagevariationsuite.shinyapps.io/
TextMining/
39 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Upload Files - txt
40 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Explore
41 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Metadata
ID Date Title Author Other
Extract from pdf files
Upload from csv file
42 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Stopwords
43 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Frequency
44 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Frequency Visualization
45 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
More Stopwords
46 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Topic Modeling
Selection of topics (how many different themes)
Selection of words per theme (how many words per topic)
Identification of the best topic number
47 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Models
LDA (Latent Dirichlet allocation)
STM (Structural Topic model)
Chronological topic visualization (lda): requires metadata
48 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Cluster Analysis
49 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Punctuation Analysis
50 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Future Directions
1 New LVS features:
(a) Traditional Rrbrul analysis (for comparison)
(b) Variable re-coding and dataset modification
2 New ITMS features:
(a) Network graphs
(b) Dynamic graphs
51 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
Acknowledgements
I would like to thank Professor Rafael Orozco, Professor IrinaShport and LSU Linguistics for inviting me and organizing thisworkshop.
52 / 54
Introduction
LanguageVariationSuite
VisualAnalytics forDigitalHumanities
InteractiveText MiningSuite
Conclusion
References
References I
[1] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:Cambridge University Press
[2] Bentivoglio, Paola and Mercedes Sedano. 1993. Investigacion sociolinguıstica: sus metodos aplicados auna experiencia venezolana. Boletın de Linguıstica 8. 3-35
[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber RandiReppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: CambridgeUniversity Press
[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in theHumanities and Social Sciences. Springer International Publishing, Cham
[5] Labov, W. 1966. The Social Stratification of English in New York City. Washington: Center for AppliedLinguistics
[6] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso
[7] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods forliterature analysis. Proceedings of the 6th EACL Workshop on Language Technology for CulturalHeritage, Social 561Sciences, and Humanities, pages 3544
[8] Passarotti, Marco, Barbara McGillivray, and David Bamman. “A Treebank-based Study on Latin WordOrder.” In proceedings of 16th International Colloquium on Latin Linguistics, At Uppsala, Sweden.2013, 340–352
[9] Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0.
[10] http://blog.kandu.com/post/57065268403/book-reading-gif
[11] http://cdn.business2community.com/wp-content/uploads/2014/09/archives01.jpg
53 / 54