Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

Training Statistical Language Training Statistical Language Models from Models from

Grammar-Generated Data: Grammar-Generated Data: A Comparative Case-StudyA Comparative Case-Study

Manny RaynerGeneva University

(joint work with Beth Ann Hockey and Gwen Christian)

Structure of talkStructure of talk

Background: Regulus and MedSLTGrammar-based language models and

statistical language models

What is MedSLT?What is MedSLT?

Open Source medical speech translation system for doctor-patient dialogues

Medium-vocabulary (400-1500 words)Grammar-based: uses Regulus platformMulti-lingual: translate through interlingua

MedSLTMedSLT

Open Source medical speech translator for doctor – patient examinations

Main system unidirectional (patient answers non-verbally, e.g. nods or points)– Also experimental bidirectional system

Two main purposes– Potentially useful (could save lives!)– Vehicle for experimenting with underlying Regulus

spoken dialogue engineering toolkit

Regulus: central goalsRegulus: central goals

Reusable grammar-based language models– Compile into recognisers

Infrastructure for using them in applications– Speech translation– Spoken dialogue

MultilingualEfficient development environmentOpen Source

$25 (paperback edition)from amazon.com

The full story…

What kind of applications?What kind of applications?

Grammar-based is – Good on in-coverage data– Good for complex, structured utterances

Users need to – Know what they can say– Be concerned about accuracy

Good target applications– Safety-critical – Medium vocabulary (~200 – 2000 words)

In particular…In particular…

Clarissa– NASA procedure assistant for astronauts– ~250 word vocabulary, ~75 command types

MedSLT– Multilingual medical speech translator– ~400 – ~1000 words, ~30 question types

SDS – Experimental in-car system from Ford Research– First prize, Ford internal demo fair, 2007– ~750 words

Key technical ideasKey technical ideas

Reusable grammar resourcesUse grammars for multiple purposes

– Parsing– Generation– Recognition

Appropriate use of statistical methods

Reusable grammar resourcesReusable grammar resources

Building a good grammar from scratch is very challenging

Need a methodology for rational reuse of existing grammar structure

Use small corpus of examples to extract structure from a large resource grammar

GeneralUG

EBL Specialization

UG to CFGCompiler

R E G U L U S

Application Specific

UG

CFGGrammar

Recognizer

NUANCE

The Regulus pictureThe Regulus picture

Lexicon

Training Corpus

PCFGGrammar

CFG to PCFGCompiler

(P)CFG to RecogniserCompiler

OperationalityCriteria

The general English grammarThe general English grammar

Loosely based on SRI Core Language Engine grammar

Compositional semantics (4 different versions) ~200 unification grammar rules ~75 features Core lexicon, ~ 450 words

(Also resource grammars for French, Spanish, Catalan, Japanese, Arabic, Finnish, Greek)

General grammar General grammar domain-specific grammardomain-specific grammar

“Macro-rule learning”Corpus-based processRemove unused rules and lexicon itemsFlatten parsed examples to remove structureSimpler structure less ambiguity

smaller search space

when do you get headaches

PP PPV PRO V N

NP NBAR

NPVBARVBAR

VP

VP

VP

S

S

UTTERANCE EBL example (1)EBL example (1)


PP PPV PRO V N

NP NBAR

NPVBARVBAR

VP

VP

VP

S

S



PP V PRO V N

NP

NPVBARVBAR

S


Main new rules:

S PP VBAR VBAR NPNP N

Using grammars for multiple Using grammars for multiple purposespurposes

Parsing– Surface words logical form

Generation– Logical form surface words

Recognition– Speech surface words

Building a speech translatorBuilding a speech translator

Combine Regulus-based components– Source-language recognizer (speech words)– Source-language parser (words logical form)– Transfer from source to target, via interlingua

(logical form logical form)– Target-language generator (logical form

words)– (3rd party text to speech)

Adding statistical methodsAdding statistical methods

Two different ways to use statistical methods:

Statistical tuning of grammarIntelligent help system

Impact of statistical tuningImpact of statistical tuning

(Regulus book, chapter 11)Base recogniser

– MedSLT with English recogniser– Training corpus: 650 utterances– Vocabulary: 429 surface words

Test data:– 801 spoken and transcribed utterances

Vary vocabulary sizeVary vocabulary size

Add lexical items (11 different versions)Total vocabulary 429 – 3788 surface wordsNew vocabulary not used in test dataExpect degradation in performance

– Larger search space– New possibilities just a distraction

Impact of statistical tuningImpact of statistical tuningfor different vocabulary sizesfor different vocabulary sizes

0

5

10

15

20

25

429 1392 2096 2698 3266 3788

CFG

PCFG

Vocabulary size

Sem

Error R

ate

Intelligent help systemIntelligent help system

Need robustness somewhereAdd a backup statistical recogniserUse it to advise the user

– Approximate match with in-coverage examples– Show user similar things they could say

Original paper: Gorrell, Lewin and Rayner, ICSLP 2002

MedSLT experimentsMedSLT experiments

(Chatzichrisafis et al, HLT workshop 2006)French English version of systemBasic questions

– How quickly do novices become experts?– Can people adapt to limited coverage?

Let subjects use system several times, and track performance

Experimental SetupExperimental Setup

Subjects– 8 medical students, no previous knowledge of system

Scenario– Experimenter simulates headache– Subject must diagnose it– 3 sessions, 3 tasks per session

Instruction– ~20 min instructions & video (headset, push-to-talk)– All other instruction from help system

Session 1Session 2

Session 3

98.6

63.4

53.9

40

50

60

70

80

90

100

Results - # InteractionsResults - # InteractionsInteractions

Results – Time/DiagnosisResults – Time/Diagnosis

00

02

04

07

09

12

14

16

19

Session 1 Session 2 Session 3

Diagnosis 1 Diagnosis 2 Diagnosis 3

Questionnaire results Questionnaire results I quickly learned how to use the system. 4.4

System response times were generally satisfactory. 4.5

When the system did not understand me, the help system usually showed me another way to ask the question. 4.6

When I knew what I could say, the system usually recognized me correctly. 4.3

I was often unable to ask the questions I wanted. 3.8

I could ask enough questions that I was sure of my diagnosis. 4.3

This system is more effective than non-verbal communication using gestures. 4.3

I would use this system again in a similar situation. 4.1

SummarySummary

After 1.5 hours of use, subjects complete task in average of 4 minutes– System implementers average 3 minutes

All coverage learned from help systemSubjects’ impressions very positive

A few words about interlinguaA few words about interlingua

Coverage in different languages diverges if left to itself– Want to enforce uniform coverage

Many-to-many translation– “N2 problem”

Solution: translate through interlingua– Tight interlingua definition

Interlingua grammarInterlingua grammar

Think of interlingua as a languageDefine using Regulus

– Mostly for constraining representations– Also get a surface form

“Semantic grammar”– Not linguistic, all about domain constraints

Example of interlinguaExample of interlingua

“YN-QUESTION pain become-better sc-when [ you sleep PRESENT] PRESENT”

[[utterance_type, ynq], [symptom, pain], [event, become_better], [tense, present], [sc, when], [clause, [[utterance_type, dcl], [pronoun, you], [action, sleep], [tense, present]]]]

Constraints from interlinguaConstraints from interlingua

Source language sentences licensed by grammar may not produce valid interlingua

Interlingua can act as a knowledge source to improve language modelling

Structure of talkStructure of talk

Background: Regulus and MedSLT Grammar-based language models and

statistical language models

Language modelsLanguage models

Two kinds of language modelsStatistical (SLM)

– Trainable, robust– Require a lot of corpus data

Grammar-based (GLM)– Require little corpus data– Brittle

Compromises between Compromises between SLM and GLMSLM and GLM

Put weights on GLM (CFG PCFG)– Powerful technique, see earlier– Doesn’t address robustness

Put GLMs inside SLMs (Wang et al, 2002)Use GLM to generate training data for SLM

(Jurafsky et al 1995, Jonson 2005)

Generating SLM training data Generating SLM training data with a GLMwith a GLM

Optimistic view– Need only small seed corpus, to build GLM– Will be robust, since finally an SLM

Pessimistic view– “Something for nothing”– Data for GLM could be used directly to build an SLM

Hard to decide– Don’t know what data went into GLM– Often just in grammar writer’s head

Regulus permits comparisonRegulus permits comparison

Use Regulus to build GLMData-driven process with explicit corpusSame corpus can be used to build SLMComparison is meaningful

Two ways to build SLMTwo ways to build SLM

Direct– Seed corpus SLM

Indirect– Seed corpus GLM corpus SLM

Parameters for indirect methodParameters for indirect method

Size of generated corpus– Can generate any amount of data

Method of generating corpus– CFG versus PCFG

Filtering– Use interlingua to filter generated corpus

CFG versus PCFG generationCFG versus PCFG generation

CFG– Use plain GLM to do random generation

PCFG– Use seed corpus to weight GLM rules– Weights then used in random generation

Interlingua filteringInterlingua filtering

Impossible to make GLM completely tightMany in-coverage sentences make no senseSome of these don’t produce valid

interlinguaUse interlingua grammar as filter

Example: CFG generated dataExample: CFG generated data

what attacks of them 're your duration all dayhave a few sides of the right sides regularly frequently hurtwhere 's it increasedwhat previously helped this headachehave not any often ever helpedare you usually made drowsy at homewhat sometimes relieved any gradually during its night's this severity frequently increased before helpingwhen are you usually at homehow many kind of changes in temperature help a history

Example: PCFG generated dataExample: PCFG generated data

does bright light cause the attacksare there its cigarettesdoes a persistent pain last several hoursis your pain usually the same beforewere there them when this kind of large meal helped joint paindo sudden head movements usually help to usually relieve the

painare you thirstydoes nervousness aggravate light sensitivityis the pain sometimes in the faceis the pain associated with your headaches

Example: PCFG generated data Example: PCFG generated data with interlingua filteringwith interlingua filtering

does a persistent pain last several hoursdo sudden head movements usually help to usually relieve the

painare you thirstydoes nervousness aggravate light sensitivityis the pain sometimes in the facehave you regularly experienced the paindo you get the attacks hoursis the headache pain betterare headaches worseis neck trauma unchanging

ExperimentsExperiments

Start with same English seed corpus – 948 utterances

Generate GLM recogniser Generate different types of training corpus

– Train SLM from each corpus Compare recognition performance

– Word Error Rate (WER)– Sentence Error Rate (SER)

McNemar sign test on SER to get significance

Experiment 1: different methodsExperiment 1: different methods

Version corpus WER SER

GLM 948 21.96% 50.62%

SLM, seed corpus 948 27.74% 58.40%

SLM, CFG, no filter 4281 49.0% 88.4%

SLM, CFG, filter 4281 44.68% 85.68%

SLM, PCFG, no filter 4281 25.98% 65.31%

SLM, PCFG, filter 4281 25.81% 63.70%

Experiment 1:Experiment 1:significant differencessignificant differences

GLM >> all SLMsseed corpus >> all generated corporaPCFG generation >> CFG generationfiltered > not filtered

However, generated corpora are small…

Experiment 2: Experiment 2: different sizes of corpusdifferent sizes of corpus


GLM 948 21.96% 50.62%

SLM, seed corpus 948 27.74% 58.40%

SLM, PCFG, no filter 16 619 24.84% 62.47%

SLM, PCFG, filter 16 619 23.80% 59.51%


SLM, PCFG, filter 497 798 23.76% 57.16%


GLM >> all SLMs large corpus > small corpus large unfiltered generated corpus ~ seed corpus

– SER for large unfiltered corpus about the same

large filtered generated corpus ~/> seed corpus – SER for large filtered corpus better, but not significant

filtered > not filtered

Experiment 3: Experiment 3: like 2, but only in-coverage datalike 2, but only in-coverage data


GLM 948 7.00% 22.37%

SLM, seed corpus 948 14.40% 42.02%


SLM, PCFG, filter 16 619 12.76% 40.86%


SLM, PCFG, filter 497 798 11.25% 36.19%


GLM >> all SLMs large corpus > small corpus large unfiltered generated corpus ~/> seed corpus

– SER for large unfiltered corpus better, not significant

large filtered generated corpus > seed corpus filtered > not filtered

Using GLMs to make SLMs:Using GLMs to make SLMs:conclusionsconclusions

Regulus lets us evaluate fairlyIndirect method for building SLM only

slightly better than direct oneGLM better than all SLM variants

– Especially clear on in-coverage data

PCFG generation much better than CFG

SummarySummary

MedSLT– Potentially useful tool for doctors in future– Good test-bed for research now

Using GLMs to build SLMs– Example of how Regulus lets us evaluate a

grammar-based method objectively

For more informationFor more information

Regulus websiteshttp://sourceforge.net/projects/regulus/http://www.issco.unige.ch/projects/regulus/Rayner, Hockey and Bouillon“Putting Linguistics Into Speech Recognition” (CSLI Press, June 2006)

Documents

Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study