Machine Learning in NLP
Joakim Nivre
Uppsala University
Linguistics and Philology
Machine Learning in NLP 1(41)
I Why do we use machine learning in NLP?
I When should we (not) use machine learning?
I Appropriate problems for machine learning (from Lecture 1):I Problems for which there is no known exact methodI Problems for which the exact method is too expensiveI Problems that evolve over time
Machine Learning in NLP 2(41)
I Why do we use machine learning in NLP?
I When should we (not) use machine learning?
I Appropriate problems for machine learning (from Lecture 1):I Problems for which there is no known exact methodI Problems for which the exact method is too expensiveI Problems that evolve over time
Machine Learning in NLP 2(41)
Eugene Wigner’s article “The Unreasonable Effectiveness of Math-ematics in the Natural Sciences” examines why so much of physicscan be neatly explained with simple mathematical formulas such asf = ma or e = mc2. Meanwhile, sciences that involve human beingsrather than elementary particles have proven more resistant to ele-gant mathematics. Economists suffer from physics envy over theirinability to neatly model human behavior. An informal, incompletegrammar of the English language runs over 1,700 pages. Perhapswhen it comes to natural language processing and related fields,we’re doomed to complex theories that will never have the eleganceof physics equations. But if that’s so, we should stop acting as ifour goal is to author extremely elegant theories, and instead embracecomplexity and make use of the best ally we have: the unreasonableeffectiveness of data.
Alon Halevy, Peter Norvig and Fernando Pereira. 2010.The Unreasonable Effectiveness of Data. IEEE Intelligent Systems.
Machine Learning in NLP 3(41)
Plan for this Lecture
I A historical perspectiveI Formal theory-driven systemsI Statistical methodsI Deep learning
I Strengths and weaknesses of machine learning in NLP
Machine Learning in NLP 4(41)
Running Example: Parsing
I Input: natural language sentence (word sequence)
I Output: tree or graph capturing syntactic structure
we saw her duck
nsubj obj
xcomp
Machine Learning in NLP 5(41)
Computational Linguistics in the 1980s
Machine Learning in NLP 6(41)
Computational Linguistics in the 1980s
I Languages described by formal systemsI Inventory of elementary units (lexicon)I Rules for combining units (grammar)
I Created by linguists in a theoretical frameworkI Linguistic levels: morphology, syntax, semanticsI Generate all and only well-formed expressions
I Combined with algorithms for analysis/synthesis
Machine Learning in NLP 7(41)
Issues
I CoverageI Hard to build a complete description of a languageI Languages are constantly changing
I RobustnessI Language use is not always well-formedI Made worse by lack of coverage
Machine Learning in NLP 8(41)
Issues
I AmbiguityI Natural language grammars inherently ambiguousI Combinatorial explosion from interacting rules and levelsI Practical applications need disambiguation
we saw her duck
nsubj obj
xcomp
nsubj
obj
nmod:poss
Machine Learning in NLP 9(41)
Statistical NLP in the 1990s
Machine Learning in NLP 10(41)
Eisner’s Model C
I Stochastic process generating a dependency treeI Tree probability = product of subtree probabilitiesI Subtree probability = product of child probabilitiesI Child conditioned on tagged head word and preceding child tag
Machine Learning in NLP 11(41)
Statistical NLP in the 1990s
I Probabilistic models of languageI Generative models of P(X ,Y )I Examples: HMM, PCFG, NB
I Parameters estimated from (annotated) dataI Maximum-likelihood estimationI Smoothing to cope with sparse data
I Inference algorithms for analysis:I Exact argmax search using dynamic programmingI Examples: Viterbi, CKY
Machine Learning in NLP 12(41)
How Does This Help?
I AmbiguityI Disambiguation through probability rankingI Learning from data more effective than heuristicsI Statistical evaluation to measure progress
we saw her duck
nsubj obj
xcomp
nsubj
obj
nmod:poss
Machine Learning in NLP 13(41)
How Does This Help?
I CoverageI Smoothing allows graceful degradationI Unknown words can be interpreted in context
I RobustnessI Probability ranking allows constraint relaxationI No sharp line between well-formed and deviant
Machine Learning in NLP 14(41)
A New Paradigm
I Emphasis on robust large-scale processingI Quantitative evaluation
I Naturally occurring test dataI Exact numerical metrics (frequency-based)
I Data-driven developmentI Naturally occurring training dataI Models induced using statistical inference
Machine Learning in NLP 15(41)
Machine Learning?
I Statistical models of the (early) 90s:I Generative models of P(X ,Y )I Maximum likelihood estimation (with smoothing)I No advanced learning algorithms – just counting
I Main limitation:I Rigid independence assumptions (local context)I Required for effective learning and efficient inference
Machine Learning in NLP 16(41)
Machine Learning in NLP (2005)
Machine Learning in NLP 17(41)
McDonald’s Discriminative Model
I Discriminative model of trees given sentencesI Online learning (perceptron style)I Max-margin objective (MIRA)I Rich features over the input-output space
Machine Learning in NLP 18(41)
Machine Learning in NLP (2005)
I Conditional or discriminative modelsI Models for prediction X → YI Examples: Perceptron, SVM, MaxEnt
I Parameters estimated from (annotated) dataI Learning as numerical optimizationI Regularization to prevent overfitting
I Inference algorithms for analysisI Exact argmax search not always possibleI Heuristic methods like beam search and reranking
Machine Learning in NLP 19(41)
How Does This Help?
I Independence assumptions can be relaxedI No need to estimate joint distribution P(X ,Y )I Features over input X come for free
I Prediction accuracy improves with rich featuresI Arbitrary combinations of input and output featuresI Fall back on heuristic inference for efficiency if needed
Machine Learning in NLP 20(41)
Problem Solved?
I Feature engineeringI Feature combinations have to be hand-craftedI Feature selection requires trial-and-error experiments
I Sparse discrete featuresI Most features are binarized symbolic features (1-hot)I Feature vectors get extremely high-dimensional but sparseI Problematic for learning and efficient inference
Machine Learning in NLP 21(41)
Deep Learning in NLP (2014)
Machine Learning in NLP 22(41)
Chen and Manning’s Transition-Based Parser
A Fast and Accurate Dependency Parser using Neural Networks 17
Model Architecture
Motivation | Model | Experiments | Analysis
ROOT has VBZ
He PRP
nsubj
has VBZ good JJ control NN . .
Stack Bu↵er
Correct transition: SHIFT
1
Input layer
Hidden layer
Output layer
Softmax probabilities
I MaltParser with MLP instead of SVM (greedy, local)
I But 2 percentage points better LAS on PTB/CTB!?
Machine Learning in NLP 23(41)
Traditional Sparse Features
A Fast and Accurate Dependency Parser using Neural Networks 10
Traditional Features
Motivation | Model | Experiments | Analysis
0 0 0 1 0 0 1 0 0 0 1 0binary, sparse dim =106 ~ 107
…
Indicator features lc(s2).t = PRP ^ s2.t = VBZ ^ s1.t = JJ
lc(s2).w = He ^ lc(s2).l = nsubj ^ s2.w = has
s2.w = has ^ s2.t = VBZ
s1.w = good ^ s1.t = JJ ^ b1.w = control
ROOT has VBZ
He PRP
nsubj
has VBZ good JJ control NN . .
Stack Bu↵er
Correct transition: SHIFT
1
I Sparse – but lexical features and interaction features crucial
I Incomplete – unavoidable with hand-crafted feature templates
I Expensive – accounts for 95% of computing time
Machine Learning in NLP 24(41)
Dense Features
A Fast and Accurate Dependency Parser using Neural Networks 13
Indicator Features Revisited
Motivation | Model | Experiments | Analysis
Our$Solution:$Neural$Networks!$Learn$a$dense$and$compact$feature$representation
0.1dense dim = 200 0.9 -0.2 0.3 -0.1 -0.5…
ROOT has VBZ
He PRP
nsubj
has VBZ good JJ control NN . .
Stack Bu↵er
Correct transition: SHIFT
1
A Fast and Accurate Dependency Parser using Neural Networks
• We$represent$each$word$as$a$dFdimensional$dense$vector$(i.e.,$word$embeddings).• $Similar$words$expect$to$have$close$vectors.
15
Distributed Representations
come
go
werewas
isgood
Motivation | Model | Experiments | Analysis
I Sparse – dense features capture similarities (words, pos, dep)
I Incomplete – neural network learns interaction features
I Expensive – matrix multiplication with low dimensionality
Machine Learning in NLP 25(41)
PoS Embeddings
A Fast and Accurate Dependency Parser using Neural Networks 31
POS Embeddings
Motivation | Model | Experiments | Analysis (van der Maaten and Hinton 2008)
Machine Learning in NLP 26(41)
Dep Embeddings
A Fast and Accurate Dependency Parser using Neural Networks 32
Dependency Embeddings
Motivation | Model | Experiments | Analysis
Machine Learning in NLP 27(41)
The Power of Embeddings
One-Hot (discrete, sparse) Embedding (continuous, dense)
I Inherently much more expressive (R× D vs. 1)
I Can capture similarities between items (sparsity)
I Can be pre-trained on large unlabeled corpora (OOV)
I Can be learned/tuned specifically for the parsing task
Machine Learning in NLP 28(41)
Recurrent Neural Networks
I Bi-LSTM encodes global context in word representations
I Character models capture morphology (and help sparsity)
Machine Learning in NLP 29(41)
Neural Network Techniques in Parsing
I Empirical results have improved substantially since 2014I Neural network techniques yield more effective features:
I Features are learned (not hand-crafted)I Features are continuous and dense (not discrete and sparse)I Features can be tuned to (multiple) specific tasksI Features can capture unbounded dependenciesI Features can capture subword regularities
I Parsing architectures remain essentially the same
Machine Learning in NLP 30(41)
Strengths and Weaknesses
I Is (deep) machine learning always the solution?
I On the one handI Learning from data is extremely powerfulI Normally the first choice for maximizing accuracy
I On the other handI Conditions for applying machine learning may not be idealI There may be additional factors to consider
Machine Learning in NLP 31(41)
Strengths and Weaknesses
I Is (deep) machine learning always the solution?
I On the one handI Learning from data is extremely powerfulI Normally the first choice for maximizing accuracy
I On the other handI Conditions for applying machine learning may not be idealI There may be additional factors to consider
Machine Learning in NLP 31(41)
The Unreasonable Effectiveness of Data?
I What kind of data is available?I Do we have labeled data?I How much data do we have?I Do we have data from the right domain/language?
I What to do if we don’t have adequate/sufficient data?I Collect and/or annotate (more) dataI Apply cross-domain or cross-language learningI Consider a rule-based (or hybrid) method
Machine Learning in NLP 32(41)
F-score Isn’t All That Matters
I We may care more about minimum than average quality
Machine Learning in NLP 33(41)
Machine Learning in NLP 34(41)
F-score Isn’t All That Matters
I We may care more about minimum than average quality
I Users may want to have predictions explained
Machine Learning in NLP 35(41)
Machine Learning in NLP 36(41)
F-score Isn’t All That Matters
I We may care more about minimum than average quality
I Users may want to have predictions explained
I There may be ethical considerations with biased data
Machine Learning in NLP 37(41)
Machine Learning in NLP 38(41)
F-score Isn’t All That Matters
I We may care more about minimum than average quality
I Users may want to have predictions explained
I There may be ethical considerations with biased data
I Companies often need to maintain legacy systems
Machine Learning in NLP 39(41)
Machine Learning in NLP 40(41)
Conclusion
I NLP today is overwhelmingly data-driven
I Deep learning is an evolution, not a revolution
I Machine learning is often the best solution
I But be open to pitfalls and alternative techniques
Machine Learning in NLP 41(41)