93
A Panorama of Natural Language Processing Ted Xiao

A Panorama of Natural Language Processing

Embed Size (px)

Citation preview

Page 1: A Panorama of Natural Language Processing

A Panorama ofNatural Language

Processing

Ted Xiao

Page 2: A Panorama of Natural Language Processing

Overview• Background

• Grammars

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demos

Page 3: A Panorama of Natural Language Processing

What is Natural Language Processing?

NLP!

Artificial Languages: Java, C++, Binary…

Page 4: A Panorama of Natural Language Processing

What is Natural Language Processing?

NLP!

Artificial Languages: Java, C++, Binary…

Natural Language: Language spoken by people.

Page 5: A Panorama of Natural Language Processing

What is Natural Language Processing?

NLP!

Artificial Languages: Java, C++, Binary…

Natural Language: Language spoken by people.

Motivation: Sophisticated linguistic analysis for human-likesophistication for a range of tasks or applications.

Page 6: A Panorama of Natural Language Processing

What is Natural Language Processing?

NLP!

Goal: have computers understand natural language in order to perform useful tasks

Artificial Languages: Java, C++, Binary…

Natural Language: Language spoken by people.

Motivation: Sophisticated linguistic analysis for human-likesophistication for a range of tasks or applications.

Page 7: A Panorama of Natural Language Processing

Task Types● Syntax

○ Parsing○ Stemming○ Part of speech tagging

● Discourse○ Parsing○ Stemming○ Part of speech tagging

Page 8: A Panorama of Natural Language Processing

Task Types● Syntax

○ Parsing○ Stemming○ Part of speech tagging

● Semantics○ Machine Translation○ Natural Language

Understanding, Generation○ OCR○ QA, Sentiment Analysis○ Coreference

● Discourse○ Parsing○ Stemming○ Part of speech tagging

● Speech○ Speech Recognition○ Text-to-Speech○ Speech-to-Text

Page 9: A Panorama of Natural Language Processing

Task Examples

Page 10: A Panorama of Natural Language Processing

Task Examples

Page 11: A Panorama of Natural Language Processing

NLP in Industry

Page 12: A Panorama of Natural Language Processing

What Makes NLP Difficult?• We don’t understand language ourselves

• Language encodes meaning

• Language is learned intuitively - easy for children, hard for computers

or

Page 13: A Panorama of Natural Language Processing

What Makes NLP Difficult?• We don’t understand language ourselves

• Language encodes meaning

• Language is learned intuitively - easy for children, hard for computers

• Ambiguity

• Language is symbolic

• Subtleties: sarcasm, wordplay, idioms...

or

Page 14: A Panorama of Natural Language Processing

NLP vs. PLPProgramming Language Processing is easier than Natural Language Processing

Page 15: A Panorama of Natural Language Processing

● The Pope’s baby steps on gays

● Scientists study whales from space

● Juvenile court to try shooting defendant

● Boy paralyzed after tumor fights back to gain black belt

Examples of ambiguity: news headlines

Page 16: A Panorama of Natural Language Processing

An NLP Disaster

Page 17: A Panorama of Natural Language Processing

An NLP Disaster: Microsoft Tay (March 2016)

Page 18: A Panorama of Natural Language Processing

...again?! Microsoft Zo (December 2016)

Page 19: A Panorama of Natural Language Processing

Overview• Background

• Linguistics

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demos

Page 20: A Panorama of Natural Language Processing

Grammars● Grammars are the formal description of the structure of a

language● Skeleton of any language

Page 21: A Panorama of Natural Language Processing

Basic Linguistics

Context

Form

Meaning

Structure

Audio

Page 22: A Panorama of Natural Language Processing

Levels of NLP

Page 23: A Panorama of Natural Language Processing

Overview• Background

• Grammars

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demos

Page 24: A Panorama of Natural Language Processing

Digitalizing Natural Language

• Need to have some measure of similarity and differences between words• Vectors can do this!

Page 25: A Panorama of Natural Language Processing

Digitalizing Natural Language

• Need to have some measure of similarity and differences between words• Vectors can do this!• We can use vector operations to gauge similarity between words

Page 26: A Panorama of Natural Language Processing

Word Vectors• 13 million tokens in the English language

• Many words are similar (cat and feline, man and woman, etc…)

• A nice idea: encode word tokens into a vector that is a point in some

word space with dimension << 13 million

Page 27: A Panorama of Natural Language Processing

One-Hot Vectors

• Express each word as an |V| dimensional vector with one 1 and

the rest 0s, where |V| is the size of our vocabulary

• One-hot vectors for a dictionary would look like:

Page 28: A Panorama of Natural Language Processing

What’s Wrong?• One hot vectors are independent (orthogonal)

• But some words are similar!

• A nicer idea: reduce the size of the space from |V| to a smaller-

dimensional subspace that encodes relationships between words

Page 29: A Panorama of Natural Language Processing

Quick Aside: Singular Value Decomposition (SVD)

(mxm) matrix of left singular vectors

(nxn) matrix of right singular vectors

(mxn) matrix with the singular values of X on its diagonals

(mxn)

Take away point: The SVD of X is the best rank-k approximation

X = USVT

Page 30: A Panorama of Natural Language Processing

Illustration of the SVD as a Rank-k Approximation

Page 31: A Panorama of Natural Language Processing

SVD-Based Methods: Window-based Co-occurrence Matrix

• Only count the number of times a word appears inside a window of a particular size around the word of interest

• Consider the following three documents:• I enjoy flying• I like NLP• I like deep learning window size = 1

Page 32: A Panorama of Natural Language Processing

Applying the SVD to the co-occurrence matrix

• X = USVT

• Truncate S at some index k based on the amount of variance captured:

• Take the sub-matrix Ui:V,1:k to be our word embeddings

• Now have a k-dimensional representation of every word in our

vocabulary!

Page 33: A Panorama of Natural Language Processing

Downfalls of SVD-based Methods

• Co-occurrence matrix is high dimensional and sparse

• SVDs are computationally expensive (quadratic cost)

• Dimensions of the co-occurrence matrix are constantly

changing as new words are added

Page 34: A Panorama of Natural Language Processing

Downfalls of SVD-based Methods

• Co-occurrence matrix is high dimensional and sparse

• SVDs are computationally expensive (quadratic cost)

• Dimensions of the co-occurrence matrix are constantly

changing as new words are added

• Solution: iteration-based methods!

Page 35: A Panorama of Natural Language Processing

Iteration-based Methods

• Word Vectors and Word Embeddings: Used to Find similarity

Compute and store representative information about a huge dataset

• Iteration-based Methods: create a model that learns one iteration at a time that

will eventually be able to encode the probability of a word given its context

• These include basic language models as well as the more advanced word2vec

Page 36: A Panorama of Natural Language Processing

Basic Language Models

• Bag of Words

• Just count the frequencies of words.

• Issues: High dimension, and order and

relations are lost

Page 37: A Panorama of Natural Language Processing

Basic Language Models

• Bag of Words

• Just count the frequencies of words.

• Issues: High dimension, and order and

relations are lost

• Term Frequency-inverse Document Frequency

• AKA TF-IDF

• How important a word is in a document

• Used in search engines!

Page 38: A Panorama of Natural Language Processing

Basic Language Models

• Question: Are there ways we can maintain

information about word order and meaning?

• Bag of Words

• Just count the frequencies of words.

• Issues: High dimension, and order and

relations are lost

• Term Frequency-inverse Document Frequency

• AKA TF-IDF

• How important a word is in a document

• Used in search engines!

Page 39: A Panorama of Natural Language Processing

Language Models• Goal: assign a probability to a sequence of tokens

• Consider the two sentences:

• “The dog wagged his tail”

• “Puffer fish bank ladder”

• Which should have a higher probability?

• If we assume that word occurrences are independent, the

probability of any given sequence of words is:

(Unigram model)

Page 40: A Panorama of Natural Language Processing

What if Word Occurrences Are Not Independent?

• Assume the probability of a sequence depends on the pairwise

probability of a word in the sequence and a word next to it (bigrams)

• A bigram model is of the form:

• The general N-gram model is given by:

Page 41: A Panorama of Natural Language Processing

Approaches So Far● Simple models trained on huge amounts of data outperform complex

models trained on small amounts of data

● Unigrams:

● Bigrams:

● N-grams:

Page 42: A Panorama of Natural Language Processing

Continuous Bag of Words(CBOW)

• Consider part of the sequence of words as context, and try to predict the center words• Sentence: “The dog wagged his tail”• Context: {“The”, “dog”, “his”, “tail}• Center words: “wagged”

• Our known parameters are the sentence in question represented by one-hot vectors• Let x(c) denote the context words• Lets y(c) denote the target word (output)

Page 43: A Panorama of Natural Language Processing

Continuous Bag of Words

Page 44: A Panorama of Natural Language Processing

Skip-grams• Now, consider the center word as context and try to predict

surrounding words

• Sentence: “The dog wagged his tail”

• Context: “wagged”

• Surrounding words: {“The”, “dog”, “his”, “tail”}

• Nearly identical set-up to CBOW, except we switch our x and y

• Input: one-hot vector (context word)

• Output: vectors describing the surrounding words

Page 45: A Panorama of Natural Language Processing

Skip-grams

Page 46: A Panorama of Natural Language Processing

Recap● We first tried condensing language into word vectors

○ We want to keep meaning in a lower dimension

○ One-hot Vectors, SVD… These are expensive!

Page 47: A Panorama of Natural Language Processing

Recap● We first tried condensing language into word vectors

○ We want to keep meaning in a lower dimension

○ One-hot Vectors, SVD… These are expensive!

● We then tried iteration based methods

○ Language Models: Bag of Words, TF IDF… no context!

Page 48: A Panorama of Natural Language Processing

Recap● We first tried condensing language into word vectors

○ We want to keep meaning in a lower dimension

○ One-hot Vectors, SVD… These are expensive!

● We then tried iteration based methods

○ Language Models: Bag of Words, TF IDF… no context!

● We add in context with N-gram models

Page 49: A Panorama of Natural Language Processing

Recap● We first tried condensing language into word vectors

○ We want to keep meaning in a lower dimension

○ One-hot Vectors, SVD… These are expensive!

● We then tried iteration based methods

○ Language Models: Bag of Words, TF IDF… no context!

● We add in context with N-gram models

● We extended these with CBOW and Skip-grams

○ CBOW: Predict center word

○ Skip-grams: Predict surrounding words

Page 50: A Panorama of Natural Language Processing

• So far, we have an expression for the chance that a sequence of

words appears as products of conditional probabilities

• Now, we’d like models that can learn the probabilities of word-

sequences

• Solution: Word2vec (Mikolov et al, 2013)

Learning word-sequence probabilities

Page 51: A Panorama of Natural Language Processing

Word2vec• A neural network implementation that learns distributed representations

for words

• 2 algorithms

• Continuous bag of words

• Skip-grams

• 2 training methods

• Negative sampling

• Hierarchical softmax

• Best part: DOES NOT NEED LABELED DATA!

Page 52: A Panorama of Natural Language Processing

Word2vec

• Many of those steps are complicated...

• Luckily, someone made software that does this for us

• Gensim is a python package that can do all of this complicating

word2vec stuff in a few lines of code

• Results: Training high dimensional word vectors on a large amount of

data captures “Subtle semantic relationships between words”

Page 53: A Panorama of Natural Language Processing

Reflecting on word2vec• Words with similar meanings occur in clusters

• Clusters are spaced such that some word relationships (such as

analogies) can be reproduced with vector math

• Famous example (with highly trained word vectors)

• “king” - “man” + “woman” = “queen”

• Useful feature: word2vec does not require labeled data

• Most data in the world is unlabeled!

• Word embeddings are very useful for prediction and translation tasks, as

well as sentiment analysis

Page 54: A Panorama of Natural Language Processing

Overview• Background

• Grammars

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demos

Page 55: A Panorama of Natural Language Processing

Modern NLP• Advances in NLP were largely driven by

• A vast increase in computing power• A better understanding of human language• Development of successful ML algorithms• Big data

• Much of current work involves:• Machine translation• Spoken dialogue and conversational agents• Machine reading• Mining social media• Analysis and generation of speaker state

Page 56: A Panorama of Natural Language Processing

Forms of NLP DataUser data Corpora

Dictionaries

Ontologies and databases

Page 57: A Panorama of Natural Language Processing

NLP Data Sources• Wikimedia

• APIs: Twitter, Wordnik, …

• Common crawl

• Wordnet

• Linguistic data consortium (www.ldc.upenn.edu)

• University sites and the academic community

• Stanford, Oxford, CMU

• Create your own!

• Web-scrape, crowd-source, linguists

Page 58: A Panorama of Natural Language Processing

Deep Learning vs Non-deep Learning Methods

• Bag of words may outperform deep learning models in

modest sized datasets

• Word2vec sees a drastic improvement with a LOT of text

• In literature, distributed word vector techniques outperform

bag of words models

Deep learning tries to capture the recursive nature of natural language

Page 59: A Panorama of Natural Language Processing

Deep Learning for NLP• Deep learning attempts to learn multiple levels of representation of increasing

complexity and abstraction

• Want computers to be able to understand the recursive nature of human

language

• Recursive/recurrent neural networks!

• DL models can be fast ways to solve NLP tasks

Page 60: A Panorama of Natural Language Processing

Recurrent Neural Network● Recurrent Neural Network

○ Connections between units

form are directed cycles

○ Internal state of the network

allows it to exhibit dynamic

temporal behavior

○ Success in speech recognition,

natural language, translation,

etc.

● Long Short-term Memory: LSTM

Page 61: A Panorama of Natural Language Processing

Recurrent Neural Network

Page 62: A Panorama of Natural Language Processing

seq2seq● Applied RNNs to Sequences

○ Generate a response based on meaningful input

○ For example, translate from English to French

● Two RNNs: an encoder that processes the input and a decoder that generates the

output.

Page 63: A Panorama of Natural Language Processing

Recursive Deep Learning• Compositional vector grammars (parsing)

• Recursive autoencoders (paraphrase detection)

• Matrix-vector RNNs (relation classification)

• Recursive neural tensor networks (sentiment analysis)

Page 64: A Panorama of Natural Language Processing

What’s at UC Berkeley?• Berkeley NLP Research - Dan Klein

• Computer Vision - Alexei Efros, Jitendra Malik

CV + NLP: Visual Question Answer

Page 65: A Panorama of Natural Language Processing

Overview• Background

• Grammars

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demos

Page 66: A Panorama of Natural Language Processing

What the (near) future holds

Bots!Think Siri, but actually functional instead of a toy

Page 67: A Panorama of Natural Language Processing

What the (near) future holdsSupporting invisible UI!

The concept of invisible or zero user interaction between user and machine

Page 68: A Panorama of Natural Language Processing

What the (near) future holdsSmarter search!

The same capabilities that allow a chatbox to understand a customer’s request can enable “search like you talk” functionality

Page 69: A Panorama of Natural Language Processing

What the (near) future holdsIntelligence from unstructured information!

Analysis that accurately understands the subtleties of natural language (choice of words, tone, etc) can provide useful knowledge and insight of

information

Page 70: A Panorama of Natural Language Processing

Overview• Background

• Grammars

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demos

Page 71: A Panorama of Natural Language Processing

NLP in Industry

Page 72: A Panorama of Natural Language Processing
Page 73: A Panorama of Natural Language Processing

NLP Architectures• Layered Model

• Preprocessing• Low-level analysis• Semantic Analysis• Conversion to end products

• Input/Output as API Structure

Page 74: A Panorama of Natural Language Processing

NLP at Scale• Systems come before algorithms

• Objective functions are messy

• Everything is changing

• Understanding-optimization trade-off

Page 75: A Panorama of Natural Language Processing

NLP at Scale• Systems come before algorithms

• Objective functions are messy

• Everything is changing

• Understanding-optimization trade-off

Page 76: A Panorama of Natural Language Processing

Developing an NLP system

1. Exploration

a. Translate real-world requirements

into a measurable goal

b. Find an appropriate level and

representation

c. Find data for experiments

Page 77: A Panorama of Natural Language Processing

Developing an NLP system

1. Exploration

a. Translate real-world requirements into

a measurable goal

b. Find an appropriate level and

representation

c. Find data for experiments

2. Development

a. Find and utilize existing tools and

frameworks

b. Set up and perform a series of

experiments

Page 78: A Panorama of Natural Language Processing

Developing an NLP system

1. Exploration

a. Translate real-world requirements into

a measurable goal

b. Find an appropriate level and

representation

c. Find data for experiments

2. Development

a. Find and utilize existing tools and

frameworks

b. Set up and perform a series of

experiments

3. Production

a. CPU/GPU intensive

b. Most NLP frameworks

are not production-ready

c. Pre- and post-

processing is invaluable

d. Collect user feedback

Page 79: A Panorama of Natural Language Processing

I Have the Model… Now What?1. Specify Performance Requirements

2. Separate Prediction Algorithm From Model

Coefficients

a. Select or Implement The Prediction Algorithm

b. Serialize Your Model Coefficients

3. Develop Automated Tests For Your Model

4. Develop Back-Testing and Now-Testing

Infrastructure

5. Challenge Then Trial Model Updates

Page 80: A Panorama of Natural Language Processing

Tips for NLP• Proper preprocessing is VERY important

• Know your domain!

• Validate your models!

• Human judges

• Cross-validation

Page 81: A Panorama of Natural Language Processing

Overview• Background

• Grammars

• Word Representation

• Modern NLP

• Future Directions

• NLP in Industry

• Demo

Page 82: A Panorama of Natural Language Processing

Programming Language Identification

Exploring code on GitHub

Goal is to figure out what language a file uses

Potential Methods?

Filename, keywords, comments

Whitespace, syntax

Must be scalable to handle large number of constantly updating repos

Page 83: A Panorama of Natural Language Processing

Existing Model

Linguist: Heuristics + Naive Bayes

Heuristics can be accurate but require updating and fine-tuning

Naive Bayes depends on word frequencies - predictions are linear to vocab size

Hard-coded rules do most of the work, leaving the Naive Bayes as a last resort

Heavily dependent on file extension classification

Selective Classification

With file extensions, only 87% of files are classified

Page 84: A Panorama of Natural Language Processing
Page 85: A Panorama of Natural Language Processing
Page 86: A Panorama of Natural Language Processing

Thank you!

[email protected]

Special thanks to Jordan Prosky

Page 87: A Panorama of Natural Language Processing

Appendix

Page 88: A Panorama of Natural Language Processing

Appendix• Language is meant to convey meaning, which we have a natural

way of encoding

• Children learn this very fast!

• Hard for computers to learn…

• Language is a symbolic signaling system

• Example: pen: or ?

• Other subtleties: sarcasm, expressive signaling, …

Page 89: A Panorama of Natural Language Processing

What makes NLP difficult?• Language is meant to convey meaning, which we have a natural

way of encoding

• Children learn this very fast!

• Hard for computers to learn…

• Language is a symbolic signaling system

• Example: pen: or ?

• Other subtleties: sarcasm, expressive signaling, …

Page 90: A Panorama of Natural Language Processing

Basics of NLP data preprocessing

• Domain specific!

• Tokenization

• Example: “This is a test that isn’t so simple”

• Tokens: “This”, “is”, “a”, “test”, “that”, “is”, “n’t”, “so”,

“simple”

• Regular expressions

• Stemming

• Lower-casing

• Removing/adding punctuation

• Other…

Page 91: A Panorama of Natural Language Processing

SVD-Based Methods• Loop over a massive dataset to accumulate word co-occurrence counts in

some matrix X

• Perform the SVD on X to get U, S, and V

• Use the rows of U as the word embeddings for all words in your dictionary

X = ?

Page 92: A Panorama of Natural Language Processing

SVD-Based Methods: Word-Document Matrix

• Assumption: related words often appear in the same document

• Loop over many documents and every time word i appears in

document j, add one to entry Xij

• Very high dimensional - let’s try something better

Page 93: A Panorama of Natural Language Processing

We are not quite done…• Need to find suitable U and V matrices!

• Two algorithms help us get what we want:

• Hierarchical softmax

• Negative sampling

• These are complicated!

• Luckily, someone made software that does this for us

• Gensim is a python package that can do all of this complicating

word2vec stuff in a few lines of code