Contextualized Language Processing with Explicit Context ...ssli.ee.washington.edu/people/mo/ContextLM-BAAI2020.pdf · Contextualized Language Processing with Explicit Context Representation

Contextualized Language Processingwith Explicit Context Representation

Mari OstendorfUniversity of Washington

Thanks to Aaron Jaech and Vicky Zayats… and Google, NSF, DARPA

Language Varies

Language use is highly context dependent (topic, social setting, source, format,...)

Humans seamlessly adapt to changing context

Computational models degrade (and often break) in new contexts; adaptation requires a lot of data

2

Examples of Contextual VariationFujitsu Ltd.'s top executive took the unusual step of publicly apologizing for his company's making bids of just one yen for several local government projects, while computer rival NEC Corp. made a written apology for indulging in the same practice.

A: ok so what do you think B: well that’s a pretty loaded topic

A: absolutely B: well here in uh hang on just a second the dog is barking ok here in oklahoma we just went through a uh major educational reform…

Interestingly, four Republicans, including the Senate Majority Leader, joined all the Democrats on the losing end of a 17-12 vote. Welcome to the coven of secularists and atheists. Not that this has anything to do with religion, as the upright Senator X swears, … I guess no one told this neanderthal that lying is considered a sin by most religions.

OK Google, call my husband.Alexa, set a timer for 10 minutes

Blog Post

Switchboard

Virtual Assistant

WSJ

The Context of this Talk

• Context takes many forms, but broadly there are two classes:• Language context: what has already been said• Situational context: the source, format, location, task, …

• Much “attention” has been given to language context: • Contextualized word embeddings (e.g. BERT neural language models)• Sequence-to-sequence models for dialog history, document grounding

• Focus of this talk:• Observable situational context for neural word sequence models• New mechanisms to represent and integrate context

4

Situational Context & Sequence Models• Many NLP applications build on sequence models

• Machine translation, question answering, speech understanding, text generation, summarization, …

• State-of-the-art systems use neural sequence models

• Situational context is often observable• Information available from metadata

(global, often categorical)• text vs. speech, +/- interactive, source, genre,

author role/ID, human vs. computer-directed…• location, date/time, reading level, …

• Associated audio/video recordings (dynamic, continuous)

5

Methods to Represent & Integrate Context

• Global context• Representation: map a tuple of context variables to an embedding• Integration: adjust neural network weight matrices

• Dynamic context• Representation: conditional encoding of variable-length context• Integration: context + language as a multimodal problem

6

Global Context• Domain mismatch is a long-standing challenge• Training with more data helps, but doesn’t solve the problem

• RoBERTa, trained on 160GB of text, still needs fine tuning for new domains• There are many variants of BERT: bioBERT, sciBERT, clinicalBERT, ….

• Can we reduce the amount of domain-specific data needed by explicitlyrepresenting context factors in multi-domain training?

• Some prior work where domain is known in training but not testing• For domains that are known, we

can use that information!

Context: tuple with multiple factors W c embedding

Controlling for Context

• Map known context into a vector “embedding”• Use the context embedding c as an auxiliary input with the

word sequence, either:

• Jointly learn word and context mappings together from multi-context data

8

w1:n Neuralnetwork

y… w1:nNeuralnetwork

y…Concatenate c to all words Use c to adjust model weights

cc

RNN LM: Concatenate context

9

et

ht-1

ht yt

c

Map context to an embedding c, concatenate to input

ℎ! = 𝜎 𝐖 𝐹ℎ!𝑒!𝑐

+ 𝑏"

A linear shift in the bias!

!𝑾 𝑽= 𝜎(W ℎ!#", 𝑒! + 𝐹𝑐 + 𝑏")

word embedding

contextembedding

hidden state

Aside: You can also use c in the output layer.

SoftmaxBias (SB): 𝑦! = softmax(Vℎ! + 𝐺𝑐 + 𝑏")

RNN LM: Use c to Control Recurrent Weights

10

et

ht-1

ht yt

c

𝑾𝑨

ℎ3 = 𝜎(W𝑨 ℎ356, 𝑒3 + 𝐹𝑐 + 𝑏6)

Additive correction:𝑊7 = 𝑊8 +𝑊9

𝑊9 is a context-controlled mixture

𝑊9 =.:;6

<𝜆(𝑐:)𝑊:

𝑽

(Jaech & Ostendorf, TACL 2018)

FactorCellModel

11

W0 +WA =

• Bases tensors (L and R) each hold k different rank r matrices, each the same size as W.

• Can generate weights for any context c, precomputing gives minimal added cost

• Can be used for gating weight matrices

c LLLL cR

(e + h) x h k x (e + h) x r

r x h x k1 x k

k x 1

Low-rank adaptation matrix

Generic weight matrix

𝜆(𝑐:)𝑊: = 𝑐:=𝐿:𝑅:

Advantages of the FactorCell Model

§ Rank hyperparameter controls the degree of adaptation – can gracefully scale with different amounts of data

§ Increasing rank is more efficient than increasing state size

§ Works for applications with thousands of different contexts; learning similarity between contexts

§ To add a new context (e.g. a new user): start with a generic context embedding and update with other weights fixed

§ Method extends to language-based context vectors12

Experimental Findings

• Experiments on 9 data sets with: • 4-170k categorical contexts + latitude/longitude contexts• Word sequences & character sequences• LM, classification, generation, query completion

• Observations:• Adapting recurrent weights is most useful for

cases with large number of contexts• Softmax bias updates are enough for word topic

modeling, but it is useless for character LMs• Online learning of context vectors is fast • Fine-grained differences observed in generation

13

(Jaech & Ostendorf, TACL 2018)

(Jaech Ph.D., 2018)

(Jaech & Ostendorf, ACL 2018)

FC SB***** amazing! great!

**** great! great!

*** good! great!

** just meh mediocre

* awful mediocre

Fill in the blank: “This was my first time coming here and the food was _____”.

Dynamic (Physical) Context• Most prior work has been on language and vision

• Audio-visual speech recognition• Vision-grounded language understanding

• Speech audio carries pitch and timing cues (prosody) that listeners use to extract meaning from speech

• In all languages, prosody is useful for segmentation, salience & emotion• In English, prosody can signal intent, sentiment, uncertainty• Prosodic cues are associated with multiple phenomena at different time

scales that are hard to disentangle à representation problem14

Wanted: Chief

of theJustice Massachusetts

SupremeCourt

Girl playing frisbee

Fruit in a bowlBananas, apples and a lemon

Prosody + Text: Multimodal Analogies

• Caption & image have both complementary and redundant info (as for words & prosody)

• Caption depends on the bounding boxes; sentence meaning depends on emphasis & break location

• Standard multimodal integration issues• How tightly coupled modalities are• Modality representation learning strategy

• For sentence understanding tasks, we use• Word-aligned prosody vectors• Conditional encodings of prosody

Photo: Rodney Chen — UltiPhotos.com

Hallie Dunham makes a catchas Hallie’s dad films in the backgroundatop his famous ladder.

UltiworldMarch 2020

Language-Normalized Prosody Features

1. Given a text, predict its prosody

2. Compare predicted with true signal: what is the difference relative to the expected variability?

3. Use zi as the prosody features

was it I mean did you put …

!p

p z

!pik | hi ~ Ν(µi,k,σ i,k

2 )

zik =

pik −µi,k

σ i,k

Innovations = variation that is not accounted for by the word sequence (i.e. default reading)

(Zayats et al., NAACL 2019)

(p is a vector of prosody features, not the waveform)

Experiments in Disfluency Detection

• Task: Disfluency detection in Switchboard• Findings:

• Innovations are almost as good as text alone• Innovations (but not raw prosody) are useful when combined with text, more

so with late fusion• Outperforms parallel CNN encoding of prosody, which does help in parsing• Disfluency interruption points: words are longer and lower energy

020406080

100 Text

RawProsodyInnovations

Examples where prosody helps:

but it‘s just you know leak leak leak everywhere

I mean [ it was + it ]

Summary• Context/domain mismatch impacts most NLP applications;

but often the context is known• Many types of context can be represented with embeddings

• Words, metadata, audio/visual• Standard approach for using context is as an added input vector à a context-dependent additive bias correction• Other solutions may lead to better results:

• For sentence-level context: adapt the weight matrix, multi-domain training• For dynamic (e.g. audio) context: multimodal integration with conditional

encoding to derive context vectors• Future work: Many other contexts & architectures to explore

18

Thank you!

19

And thanks again to Aaron Jaech and Vicky Zayats

Documents

Contextualized Language Processing with Explicit Context ...ssli.ee.washington.edu/people/mo/ContextLM-BAAI2020.pdf · Contextualized Language Processing with Explicit Context Representation