22
NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop – IHR – June 21, 2012

NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

NLP and Text Mining: an Introduction

Matteo Romanello (DAI/KCL)

Histore Workshop – IHR – June 21, 2012

Page 2: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Introduction

Basic Concepts

Page 3: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Section 1

Introduction

Page 4: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

me

I BA Classics (Greek Literature and Philology)I MA Digital Humanities (Univ. of Venice)

I e-journals in Classics

I Currently:I PhD in Digital Humanities, King’s College London

I information extraction from secondary sources

I Research Associate at German Archeological Institute (Berlin)I Digital Infrastructure for Research in the Arts and Humanities

(DARIAH)

Page 5: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

What and Why?

Page 6: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Digging into Data Challenge

http://criminalintent.org/

Page 7: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

NLP Methods

< 1990s

I rely heavily on hand-coded rulesI extract named entities with regexps

I grammars, parsing, etc.

I top down

I hardly scalable

>= 1990s

I emphasis on statistical based approach

I machine learning

I bottom up

I scalable

Page 8: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

NLP in DH

I increasing need for mediation of NLP knowledgeI adoption and appropriation of technology need

I understanding of technologyI familiarising with

I JargonI to code or not to code?I basic concepts

I understanding a fieldI evolving quicklyI with a growing body of literatureI highly specialised

Page 9: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

(some of the main) NLP TasksSpeech Processing

I Machine Translation

I Speech Synthesis

Information Extraction

I Named Entity ExtractionI Named Entity [Classification | Resolution]

I Relationship Extraction

I Co-reference Resolution

Text Classification

I Sentiment Analysis

I Topic Modelling

Page 10: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

My playlist of NLP frameworks

I Voyeur/Voyant tools [web-based]I reading, text analysisI text visualisation

I Natural Language Toolkit [Python]

I General Architecture for Text Engineering (Uni Sheffield)[Java]

I LingPipe [Java]

I OpenNLP (Apache foundation) [Java]

Page 11: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Challenges for NLP in DH

I tools not always work straight out of the boxI issues with

I character encoding (despite Unicode)I output of OCRon historical documentsI normalisation and pre-processing

I lack of ad-hoc resourcesI datasets for training, testing, evaluationI dictionaries and gazetteersI previous results for comparison

Page 12: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Section 2

Basic Concepts

Page 13: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Machine Learning

Supervised

I model is learned fromtraining data

Models

I Hidden Markov Model

I Support Vector Machine

I Conditional RandomFields

Applications

I sequence labelling

Unsupervised

I data are fit into a model

Models

I Clustering

I Latent DirichletAllocation

I Latent Semantic Indexing

Applications

I document clustering

I topic modelling

Page 14: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Machine Learning Cycle (Sequence Labelling)

Page 15: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Evaluation

I TP, FP, TN, FN are defined in relation to a specific taskI applicable to those where is known (quantifiable) what we are

looking for

I Information RetrievalI retrieving of information relevant to a given search queryI TP True Positives

I docs we did expect to show up and showed up (relevant,present)

I FP False PositivesI docs we didn’t expect to show up but showed up (not

relevant, present)

I TN True NegativesI not relevant docs we didn’t expect to show up and did not

show up (not relevant, missing)

I FN False NegativesI relevant docs we didn’t expect to show up but showed up

(not relevant, present)

Page 16: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Evaluation Metrics

I precisionI precision = tp

tp+fp

I recallI recall = tp

tp+fn

I accuracyI accuracy = tp+tn

tp+tn+fp+fn

I f-scoreI fscore = 2 ∗ precision∗recall

precision+recall

Page 17: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Topic ModellingM. Jockers, The LDA Buffet is Now Open; or, Latent DirichletAllocation for English Majors

Key concepts

I the algorithm extracts topics and representative wordsI the human interpreter eventually assigns a name/label to each

topic

I the number of topics is decided a priori

I each doc has different % of all the topics

I diachronic/synchronic exploration of topics

TM frameworks

I Mallet (Java)

I Gensim (Python)

I Stanford Topic Modelling Toolbox

Page 18: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Topic Modelling (cont’d)

https://dhs.stanford.edu/algorithmic-literacy/

my-definition-of-topic-modeling/

Page 19: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Martha Ballard’s Diary

http://historying.org/2010/04/01/

topic-modeling-martha-ballards-diary/

Page 20: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Thematic Index of Classics in JSTOR

http://catalog.perseus.tufts.edu/jstor/

Page 21: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Mining the Dispatch

http://dsl.richmond.edu/dispatch/

Page 22: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic

Comprehending the Digital Humanities

https://dhs.stanford.edu/

comprehending-the-digital-humanities/