Upload
olga-scrivner
View
106
Download
0
Embed Size (px)
Citation preview
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Interactive Visual Data AnalysisPart Two
Interactive Text Mining Suite
Olga Scrivner
Indiana University
Workshop in Methods
1 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Outline
1 Introduce a web application for text processing and mining
2 Learn about natural language processing techniques
3 Develop practical skills
2 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Data Mining
“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what
we are looking for.” (Blei 2012)
3 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
New Ways of Exploring Data Collections
Word clouds (Vuillemot et al., 2009)
4 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Visualization Methods
Social network graphs (Rydberg-Cox, 2011)
5 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Visualization Methods
Tracking emotion and sentiment in fairy tales(Mohammad, 2012)
6 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Modeling
Discovering underlying theme of collection from Science magazine1990-2000 (Blei 2012)
7 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Technological and Methodological Obstacles
Many tools require some programming skills (Mallet,Meta, R and Python libraries)
GUI tools are limited to certain formats and functions(Voyant, PaperMachine)
Lack of active control by users
8 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Interactive Text Mining Suite
A user-friendly tool for quantitative analysis andvisualization of unstructured data
Platform-independent
Interactive
9 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
ITMS Structure
1 File Uploads
Upload files (txt, pdf, rdf and Google books API)
2 Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3 Data Visualization
Word frequencies, Cluster analysis and topic modeling
10 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
ITMS Structure
1 File Uploads
Upload files (txt, pdf, rdf and Google books API)
2 Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3 Data Visualization
Word frequencies, Cluster analysis and topic modeling
10 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Workshop Files
Download 3 text files
http://ssrc.indiana.edu/seminars/wim.shtml
NY Times articles (3 documents in a plain text format)
ITMS Web site:
http://www.interactivetextminingsuite.com
11 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Upload File
12 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Upload File
12 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Upload File
12 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Preprocessing Data
Before performing data analysis we should preprocess data.
13 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Preprocessing Options
Select preprocessing options and click apply.
14 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Stopwords
Stopwords (e.g. the, and): select Default for English
15 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Manual Removal of Stopwords
Based on the need, remove any additional stopwords that youmay consider a noise, e,g, paper, shows etc
Select apply
16 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Stemming
To improve analytics, you can stem all your tokens, ex. insteadof worked, works, working, you will have only one relevantstem work
17 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Metadata Extraction
You can extract or upload metadata. You will need datestamp(year) information for chronological topic modeling.
18 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Visualization
19 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Word Cloud Representation
20 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Customization
21 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Cluster Analysis
You need to have at least three documents
Documents will be grouped based on their term similaritymeasures
22 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Cluster Analysis
23 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Modeling
LDA (Latent Dirichlet allocation)
STM (Structural Topic model)
Chronological topic visualization (lda): requires metadata
24 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Modeling Tuning
Selection of topics (how many different themes)
Selection of words per theme (how many words per topic)
Selection of iteration
25 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Model Selection
26 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
LDA Topic Model
27 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
STM Topic Model
28 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Other Formats - Google Books
Before switching to other data formats, refresh your localbrowser.
Start with File Uploads and select Structured Data
29 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Other Formats - Google Books
Select your search terms and submit
Current limitation is 40 books
30 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Future Options
Shiny Web Application is highly customizable
1 Part-of-speech tagging (tm package)
2 Network analysis (igraph package)
3 Name Entity Recognition (NLP package)
4 Twitter Streaming (twitterR package) - will requires user’stwitter set-up for streaming but information will beprovided how to set it up
Open for other suggestions and collaboration - [email protected]
31 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Acknowledgements
I would like to thank WIM for providing this opportunity.
Contributors: Jefferson Davis, Irina Trapido, Jay Lee
32 / 33
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
References I
[1] Many open source R packages: tm, shiny, NLP, stringi, stringr, topicmodels, lda and many more
[2] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:Cambridge University Press
[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber RandiReppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: CambridgeUniversity Press
[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in theHumanities and Social Sciences. Springer International Publishing, Cham
[5] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso
[6] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods forliterature analysis. Proceedings of the 6th EACL Workshop on Language Technology for CulturalHeritage, Social 561Sciences, and Humanities, pages 35–44image credits: https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif
33 / 33