Download pdf - Machine Learning in NLP - Uppsala Universitynivre/master/ml13-18.pdf · 2018-05-23 · I Why do we use machine learning in NLP? I When should we (not) use machine learning? I Appropriate

Machine Learning in NLP

Joakim Nivre

Uppsala University

Linguistics and Philology

Machine Learning in NLP 1(41)

I Why do we use machine learning in NLP?

I When should we (not) use machine learning?

I Appropriate problems for machine learning (from Lecture 1):I Problems for which there is no known exact methodI Problems for which the exact method is too expensiveI Problems that evolve over time


I Why do we use machine learning in NLP?

I When should we (not) use machine learning?

I Appropriate problems for machine learning (from Lecture 1):I Problems for which there is no known exact methodI Problems for which the exact method is too expensiveI Problems that evolve over time


Eugene Wigner’s article “The Unreasonable Effectiveness of Math-ematics in the Natural Sciences” examines why so much of physicscan be neatly explained with simple mathematical formulas such asf = ma or e = mc2. Meanwhile, sciences that involve human beingsrather than elementary particles have proven more resistant to ele-gant mathematics. Economists suffer from physics envy over theirinability to neatly model human behavior. An informal, incompletegrammar of the English language runs over 1,700 pages. Perhapswhen it comes to natural language processing and related fields,we’re doomed to complex theories that will never have the eleganceof physics equations. But if that’s so, we should stop acting as ifour goal is to author extremely elegant theories, and instead embracecomplexity and make use of the best ally we have: the unreasonableeffectiveness of data.

Alon Halevy, Peter Norvig and Fernando Pereira. 2010.The Unreasonable Effectiveness of Data. IEEE Intelligent Systems.


Plan for this Lecture

I A historical perspectiveI Formal theory-driven systemsI Statistical methodsI Deep learning

I Strengths and weaknesses of machine learning in NLP


Running Example: Parsing

I Input: natural language sentence (word sequence)

I Output: tree or graph capturing syntactic structure

we saw her duck

nsubj obj

xcomp


Computational Linguistics in the 1980s


Computational Linguistics in the 1980s

I Languages described by formal systemsI Inventory of elementary units (lexicon)I Rules for combining units (grammar)

I Created by linguists in a theoretical frameworkI Linguistic levels: morphology, syntax, semanticsI Generate all and only well-formed expressions

I Combined with algorithms for analysis/synthesis


Issues

I CoverageI Hard to build a complete description of a languageI Languages are constantly changing

I RobustnessI Language use is not always well-formedI Made worse by lack of coverage


Issues

I AmbiguityI Natural language grammars inherently ambiguousI Combinatorial explosion from interacting rules and levelsI Practical applications need disambiguation

we saw her duck

nsubj obj

xcomp

nsubj

obj

nmod:poss


Statistical NLP in the 1990s


Eisner’s Model C

I Stochastic process generating a dependency treeI Tree probability = product of subtree probabilitiesI Subtree probability = product of child probabilitiesI Child conditioned on tagged head word and preceding child tag


Statistical NLP in the 1990s

I Probabilistic models of languageI Generative models of P(X ,Y )I Examples: HMM, PCFG, NB

I Parameters estimated from (annotated) dataI Maximum-likelihood estimationI Smoothing to cope with sparse data

I Inference algorithms for analysis:I Exact argmax search using dynamic programmingI Examples: Viterbi, CKY


How Does This Help?

I AmbiguityI Disambiguation through probability rankingI Learning from data more effective than heuristicsI Statistical evaluation to measure progress

we saw her duck

nsubj obj

xcomp

nsubj

obj

nmod:poss


How Does This Help?

I CoverageI Smoothing allows graceful degradationI Unknown words can be interpreted in context

I RobustnessI Probability ranking allows constraint relaxationI No sharp line between well-formed and deviant


A New Paradigm

I Emphasis on robust large-scale processingI Quantitative evaluation

I Naturally occurring test dataI Exact numerical metrics (frequency-based)

I Data-driven developmentI Naturally occurring training dataI Models induced using statistical inference


Machine Learning?

I Statistical models of the (early) 90s:I Generative models of P(X ,Y )I Maximum likelihood estimation (with smoothing)I No advanced learning algorithms – just counting

I Main limitation:I Rigid independence assumptions (local context)I Required for effective learning and efficient inference


Machine Learning in NLP (2005)


McDonald’s Discriminative Model

I Discriminative model of trees given sentencesI Online learning (perceptron style)I Max-margin objective (MIRA)I Rich features over the input-output space


Machine Learning in NLP (2005)

I Conditional or discriminative modelsI Models for prediction X → YI Examples: Perceptron, SVM, MaxEnt

I Parameters estimated from (annotated) dataI Learning as numerical optimizationI Regularization to prevent overfitting

I Inference algorithms for analysisI Exact argmax search not always possibleI Heuristic methods like beam search and reranking


How Does This Help?

I Independence assumptions can be relaxedI No need to estimate joint distribution P(X ,Y )I Features over input X come for free

I Prediction accuracy improves with rich featuresI Arbitrary combinations of input and output featuresI Fall back on heuristic inference for efficiency if needed


Problem Solved?

I Feature engineeringI Feature combinations have to be hand-craftedI Feature selection requires trial-and-error experiments

I Sparse discrete featuresI Most features are binarized symbolic features (1-hot)I Feature vectors get extremely high-dimensional but sparseI Problematic for learning and efficient inference


Deep Learning in NLP (2014)


Chen and Manning’s Transition-Based Parser

A Fast and Accurate Dependency Parser using Neural Networks 17

Model Architecture

Motivation | Model | Experiments | Analysis

ROOT has VBZ

He PRP

nsubj

has VBZ good JJ control NN . .

Stack Bu↵er

Correct transition: SHIFT

1

Input layer

Hidden layer

Output layer

Softmax probabilities

I MaltParser with MLP instead of SVM (greedy, local)

I But 2 percentage points better LAS on PTB/CTB!?


Traditional Sparse Features


Traditional Features


0 0 0 1 0 0 1 0 0 0 1 0binary, sparse dim =106 ~ 107

…

Indicator features lc(s2).t = PRP ^ s2.t = VBZ ^ s1.t = JJ

lc(s2).w = He ^ lc(s2).l = nsubj ^ s2.w = has

s2.w = has ^ s2.t = VBZ

s1.w = good ^ s1.t = JJ ^ b1.w = control

ROOT has VBZ

He PRP

nsubj


Stack Bu↵er


1

I Sparse – but lexical features and interaction features crucial

I Incomplete – unavoidable with hand-crafted feature templates

I Expensive – accounts for 95% of computing time


Dense Features


Indicator Features Revisited


Our$Solution:$Neural$Networks!$Learn$a$dense$and$compact$feature$representation

0.1dense dim = 200 0.9 -0.2 0.3 -0.1 -0.5…

ROOT has VBZ

He PRP

nsubj


Stack Bu↵er


1

A Fast and Accurate Dependency Parser using Neural Networks

• We$represent$each$word$as$a$dFdimensional$dense$vector$(i.e.,$word$embeddings).• $Similar$words$expect$to$have$close$vectors.

15

Distributed Representations

come

go

werewas

isgood


I Sparse – dense features capture similarities (words, pos, dep)

I Incomplete – neural network learns interaction features

I Expensive – matrix multiplication with low dimensionality


PoS Embeddings


POS Embeddings

Motivation | Model | Experiments | Analysis (van der Maaten and Hinton 2008)


Dep Embeddings


Dependency Embeddings



The Power of Embeddings

One-Hot (discrete, sparse) Embedding (continuous, dense)

I Inherently much more expressive (R× D vs. 1)

I Can capture similarities between items (sparsity)

I Can be pre-trained on large unlabeled corpora (OOV)

I Can be learned/tuned specifically for the parsing task


Recurrent Neural Networks

I Bi-LSTM encodes global context in word representations

I Character models capture morphology (and help sparsity)


Neural Network Techniques in Parsing

I Empirical results have improved substantially since 2014I Neural network techniques yield more effective features:

I Features are learned (not hand-crafted)I Features are continuous and dense (not discrete and sparse)I Features can be tuned to (multiple) specific tasksI Features can capture unbounded dependenciesI Features can capture subword regularities

I Parsing architectures remain essentially the same


Strengths and Weaknesses

I Is (deep) machine learning always the solution?

I On the one handI Learning from data is extremely powerfulI Normally the first choice for maximizing accuracy

I On the other handI Conditions for applying machine learning may not be idealI There may be additional factors to consider


Strengths and Weaknesses

I Is (deep) machine learning always the solution?

I On the one handI Learning from data is extremely powerfulI Normally the first choice for maximizing accuracy

I On the other handI Conditions for applying machine learning may not be idealI There may be additional factors to consider


The Unreasonable Effectiveness of Data?

I What kind of data is available?I Do we have labeled data?I How much data do we have?I Do we have data from the right domain/language?

I What to do if we don’t have adequate/sufficient data?I Collect and/or annotate (more) dataI Apply cross-domain or cross-language learningI Consider a rule-based (or hybrid) method


F-score Isn’t All That Matters

I We may care more about minimum than average quality





I Users may want to have predictions explained






I There may be ethical considerations with biased data






I There may be ethical considerations with biased data

I Companies often need to maintain legacy systems



Conclusion

I NLP today is overwhelmingly data-driven

I Deep learning is an evolution, not a revolution

I Machine learning is often the best solution

I But be open to pitfalls and alternative techniques