7
DAY 2: NLU PIPELINE AND TOOLS Mark Granroth-Wilding 1 NLP PIPELINE: REMINDER Complete NLU too complex Break into manageable subtasks Develop separate tools Different applications, different combinations Reuse effort on individual tools Lot of research effort on subtasks Often, tools & data public Web text Structured knowledge base Low-level processing ... Annotated text Abstract processing Further annotated text 4 NLP PIPELINE: REMINDER Advantages: Reuse of tools Common work on subtasks E.g. parsing Evaluation of components Easy combinations for many applications Disadvantages of pipeline: Discrete stages: no feedback Improvements on sub-tasks might not benefit pipelines Web text Structured knowledge base Low-level processing ... Annotated text Abstract processing Further annotated text 5 NLG PIPELINE Natural Language Generation can also use pipeline Same reasons, same drawbacks Not so standardized Far fewer tools for sub-tasks More on day 6 about NLG pipeline 6 REUSABLE COMPONENTS Defining standard sub-tasks can reuse models and tools across pipelines Improvements on sub-tasks benefit many applications Publicly available code/tools/models. E.g.: Does: tokenization, POS tagging, named-entity recognition, ... Used by: Adam: question answering Dragonfire : virtual assistant EpiTator : infectious disease tracker 7 DATA IN THE PIPELINE Data passed between components varies greatly Components perform analysis of input Output annotations word sentence discourse document 8 LEVELS OF ANALYSIS Sub-word Character (grapheme): A l i c e w a l k s Phoneme, linguistic sound unit: æ l i s w O k s Morpheme, smallest meaningful unit: alice walk- s Word: alice walks quickly Phrase, clause: alice walks, then she runs Sentence, utterance Paragraph, section, discourse turn, ... Document Document collection, corpus 9 TOOLS AND TOOLKITS Pipeline allows component reuse Tools for subtasks can be shared Many toolkits provide standard components Compare: accuracy speed pre-trained models (domain, language) 11

TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

DAY 2: NLU PIPELINE AND TOOLS

Mark Granroth-Wilding

1

NLP PIPELINE: REMINDER

• Complete NLU too complex

• Break into manageable subtasks

• Develop separate tools

• Different applications, different combinations

• Reuse effort on individual tools

• Lot of research effort on subtasks

• Often, tools & data public

Web text

Structuredknowledge base

Low-levelprocessing

. . .

Annotatedtext

Abstractprocessing

Furtherannotated text

4

NLP PIPELINE: REMINDER

• Advantages:• Reuse of tools• Common work on subtasks

E.g. parsing• Evaluation of components• Easy combinations for many applications

• Disadvantages of pipeline:• Discrete stages: no feedback• Improvements on sub-tasks

might not benefit pipelines

Web text

Structuredknowledge base

Low-levelprocessing

. . .

Annotatedtext

Abstractprocessing

Furtherannotated text

5

NLG PIPELINE

• Natural Language Generation can also use pipeline

• Same reasons, same drawbacks

• Not so standardized

• Far fewer tools for sub-tasks

More on day 6 about NLG pipeline

6

REUSABLE COMPONENTS

• Defining standard sub-tasks →can reuse models and tools across pipelines

• Improvements on sub-tasks benefit many applications

• Publicly available code/tools/models. E.g.:

Does:tokenization,POS tagging,named-entity

recognition, ...

Used by:Adam:

question answering

Dragonfire:virtual assistant

EpiTator : infectiousdisease tracker

7

DATA IN THE PIPELINE

• Data passed between components varies greatly

• Components perform analysis of input• Output annotations

• word• sentence• discourse• document

8

LEVELS OF ANALYSIS

• Sub-word• Character (grapheme): A l i c e w a l k s

• Phoneme, linguistic sound unit: æ l i s w O k s

• Morpheme, smallest meaningful unit: alice walk- s

• Word: alice walks quickly

• Phrase, clause: alice walks, then she runs

• Sentence, utterance

• Paragraph, section, discourse turn, ...

• Document

• Document collection, corpus

9

TOOLS AND TOOLKITS

• Pipeline allows component reuse

• Tools for subtasks can be shared

• Many toolkits provide standard components

• Compare:• accuracy• speed• pre-trained models (domain, language)

11

Page 2: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

EXAMPLE: STANFORD CORENLP

en de fr ar zh

tokenization

POS tagging

lemmatization

NER

parsing

dep parsing

coref

sentiment...

Java / command line / APIOpen source(Demo coming up. . . )

12

EXAMPLE: SPACY

en de fr es it

tokenization

POS tagging

NER

dep parsing

Fewer tools, different languagesPythonOpen sourceVery fast(See assignments)

13

EXAMPLE: GENSIM

• Specialized tool

• Topic modelling(see later in course)

• Language independent

• Late in pipeline: abstract analysis

• Use other tools/toolkits for earlier stages:• tokenization• lemmatization• etc...

14

EXAMPLE: GENSIM

Text corpus (documents)

Sentence split

Tokenize

Sentences

Lemmatize

Tokens

Count document words

Lemmas

Train topic model

Bags of words

Trained model parameters

15

NLU SUB-TASKS

• Some typical sub-tasks

• Mostly early in pipeline: “low-level”

• Brief intro: more on some later

1. Speech recognition

2. Text recognition

3. Morphology

4. POS-tagging

5. Parsing

6. Named-entity recognition

7. Semantic role labelling

8. Pragmatics

Language data

Structuredknowledge repr.

Low-levelprocessing

. . .

Annotatedtext

Abstractprocessing

Furtherannotated text

low-level

more abstract

abstract

18

SPEECH RECOGNITION

• Understanding human speech

SpeechAudio signal

Finally a small settlement

loomed ahead. It was of the

familiar style of toy-building-

block architecture affected by

the ant-men, and...

Text

• NL interfaces

• Noisy: challenges for NLP further on

• Components:• Acoustic model: audio → text• Language model: expectations about text More later...

19

TEXT RECOGNITIONOptical character recognition (OCR)

• Understanding printed/written text

Scanned document

Finally a small settlement loomed

ahead. It was of the familiar style of

toy-building-block architecture...

Text

• E.g. digitizing libraries

• Huge variation in how letters appear:

• Earlier: model image → character classification

• Recent methods: take into account context

20

TOKENIZATION

• Many methods use word-based analysis

• What is a word?

• Often split text → word (token) sequence: tokenization

First approximation: split on spaces

Arkilu pursed her lips inthought.

Arkilu / pursed / her / lips /in / thought / .

21

Page 3: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

TOKENIZATION

First approximation: split on spaces

Often not good enough:

“Really meaning,” Arkiluinterposed, . . .

“ / Really / meaning / , / ” /Arkilu / interposed / , . . .

Some other tricky cases:

black-furred, to-day, N.Y.U., 5,000

Language-specific

22

MORPHOLOGY

• Analysis of internal structure of words

unhelpfulness → un+help+ful+ness

• Splitting words into stems, affixes, compounds

thunderstorm → thunder+storm

• Useful for:• categorization using morphological features -s → plural• text normalization robot, robot, robot’s → robot• generation

robot+plur → robotsman+plur → men

More on specific methods later today and (tomorrow).

23

POS TAGGING

• Part of speech: ancient form of shallow grammatical analysis

• Token-level categories

• Distinguish syntactic function of words in broad classes

For the present

Noun

, we are. . . vs. The

Adjective

present situation. . .

Other ambiguity remains: He gave a present

Also noun

Common NLP subtask: part-of-speech (POS) tagging

More on POS-tagging methods and statistical models tomorrow

24

PARSING

• Syntax: models of structures underlying NL sentences

• Captures dependencies between words

• Analysis required to interpret meaning (semantics)

Alice who saw the man, who pushed Bob, who ate the apple, walks quickly

• Parsing: inference of structure underlying sentence

25

PARSING

• Parsing: inference of structure underlying sentence

• Useful for:• Modelling human language processing• Disambiguation of sentence meaning• Structuing statistical models• Splitting sentences into meaningful units (chunks)• Much more!

More on syntax, parsing and statistical models on day 4.

26

NAMED ENTITY RECOGNITION

• Named entities: referencesto people, organisations,products, etc

• Identification important forextracting structured data

• Can link to known entities inknowledge base

• Many practical uses

Example

DARPA hopes the ALIASprogramme will produce anautomated system that will becost effective.In addition to the 737 simulatorand the Cessna light aircraft,Aurora has also flown a DiamondDA42 light aircraft and a BellUH-1 helicopter.

NER comes up again on day 8.

27

SEMANTIC ROLE LABELLING

• Semantic roles capture key parts of structure of meaning ina sentence

• Who did what to whom?

Alice saw the man that Bob pushed

Alice is agent of sawman is patient of saw

man is patient of pushed

• Semantic role labelling: inference of these relationships

• More abstract than syntax

• Less structured than formal semantics

28

PRAGMATICS

• Pragmatics concerns meaning in broader context

• Includes questions of e.g.• conversational context• speaker’s intent• metaphorical meaning• background knowledge

• Abstract analysis

• Depends heavily on other types of analysis seen here

• Many unsolved problems

• Some tasks tackled in NLP: late in pipelines

Some aspects of pragmatics covered on day 10.

29

Page 4: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

PIPELINE EXAMPLE REVISITEDIn small groups

• Repeat exercise from yesterday

• Assume you:• are a computer• have database of logical/factual world knowledge• have lots of rules/statistics about English

• What is involved in:• extracting & encoding relevant information?• answering the question?

• Formulate as pipeline

• Don’t worry about correct component names!

31

PIPELINE EXAMPLE REVISITEDIn small groups

• Assume you:• are a computer• have database of logical/factual world knowledge• have lots of rules/statistics about English

• What is involved in:• extracting & encoding relevant information?• answering the question?

• Formulate as pipeline

• Don’t worry about correct component names!

A robotic co-pilot developed underDARPA’s ALIAS programme hasalready flown a light aircraft.

What agency has created acomputer that can pilot a plane?

32

PIPELINE EXAMPLE REVISITEDIn small groups

• Assume you:• are a computer• have database of logical/factual world knowledge• have lots of rules/statistics about English

• What is involved in:• extracting & encoding relevant information?• answering the question?

• Let’s see some pipelines!

A robotic co-pilot developed underDARPA’s ALIAS programme hasalready flown a light aircraft.

What agency has created acomputer that can pilot a plane?

33

EXAMPLE PIPELINEInformation Extraction

Raw text

Sentencesegmentation

Tokenization

Sentences

Part-of-speechtagging

Tokenizedsentences

Entitydetection

POS-taggedsentences

Relationdetection

Chunkedsentences

Relations

Text input

Facebook chairman Mark Zuckerberg wassummoned to testify in front of EUlawmakers.

Relation output

(Mark Zuckerberg, chairman-of,

Facebook)

. . .

More on later stageson day 8.

Example from NLTK Book: https://www.nltk.org/book/ch07.html

34

SOME OTHER TOOLKITS

OpenNLP

Tokenization, POS tagging, lemmatization, parsing, NER, ...Pretrained models (some components): en, de, es, nl, pt, se.Java / command line

NLTK

Tokenization, POS tagging, lemmatization, parsing, NER,language modelling, WSD, ...Some pretrained models, mostly en.Python. Primarily developed for teaching

36

SOME OTHER TOOLKITS

TextBlob

Tokenization, POS tagging, lemmatization, simple parsingModels: en, fr, de.Python. Easy to use

Flair

Tokenization, POS tagging, NER, ...Pretrained models: mostly en. Some de, fr.Python. Fast. SOTA for many tasks

37

LIVE DEMOStanford CoreNLP

• Online demo with visualization:http://corenlp.run/

• Many tools can be selected

• Some run whole pipelines: e.g. OpenIE (informationextraction), sentiment

• Let’s try some examples. . .

38

SUB-TASKS COMING UP

More on some sub-tasks later in course:

• Morphology: today and tomorrow

• POS tagging: day 3

• Syntactic parsing: day 4

• NLG sub-tasks / components: day 6

• Word-level (lexical) meaning: day 7

• Document-level meaning: day 7

• Named-entity recognition: day 8

• Discourse, pragmatics: day 10

39

Page 5: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

REGULAR EXPRESSION AND FSAs

Coming up today:

• Brief reminder of regular expressions

• Theory and notation for finite-state automata

• Introduction to morphological analysis

Groundwork for tomorrow, when we put these things together

42

REGULAR EXPRESSIONSBrief reminder

Pattern Matches

/radio/ ‘It is the radio. Know then, O Queen/[Rr]adio/ late that night, the Radio Man/[sbt]ack/ Crota was already back in the fray/[0-9]/ showed the time to be 1025;/radio sets?/ powerful radio sets invented by the beast

returning to the hidden radio set, whence/([Dd]it|[Dd]ah)/ ‘Dah-dit-dah-dit dah-dah-dit-dah.

43

REGULAR EXPRESSIONSBrief reminder

Pattern Matches

/final*/ we finally restored it,

/final+/ we finally restored it,

/radio./ conventional radioese, I repeated/.(it|ah)(-.(it|ah))*/ ‘Dah-dit-dah-dit dah-dah-dit-dah.

• Need more of a reminder? Jurafsky & Martin 3, 2.1

• Some common regex features not technically regularE.g. memory

44

REGULAR LANGUAGES

• Regular expression defines a regular language

• Set of strings accepted by regex

L(r) = {s|accept(r , s)}

• Language is regular iff ∃ regex for it

• Regular languages describe some aspects of NL

• Useful tool for some NLP

• Particularly: preprocessing & early stages of analysis

• See limitations tomorrow

45

FINITE-STATE AUTOMATA

• Another string-testing formalism:finite-state automaton (FSA)

• Process string by transitioning between states

1 2 3a

a

h

{ ‘ah’, ‘aah’, ‘aaah’, ... }

• Equivalent regex: /a+h/

46

ANOTHER FSA

1 2 3a

a

h

,

{ ‘ah, ah’, ‘ah, aah’, ‘aaaaah, ah, aah’, ... }

• Equivalent regex: /a+h(, a+h)*/

47

FSAs & REGEXES

• Acceptance by FSA ≡ acceptance by regex

• Every FSA has equivalent regex

Regularlanguages

FSAsRegular

expressions

(Regular grammars)

48

FINITE STATE TRANSDUCERS

• Extend FSA to output something: not just YES/NO

• Each accepting edge also outputs

• Finite-state transducer

• Same strings/language as FSA

• Output may be:• translation• analysis• spelling transformation• . . .

49

Page 6: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

FSA → FST

1 2

dit

dah

-,

{ ‘dit-dah-dah, dah-dit’, ... }

/(dit|dah)(-(dit|dah))*(, (dit|dah)(-(dit|dah))*)*/

50

FSA → FST

1 2

dit : 0

dah : 1

- : ε

, : /

• Each state: input : output

• Translates

dit-dah-dah, dah-dit, dit ⇒ 0-1-1/1-0/1

• Common use in NLP: analysis of internal word structure→ morphology

51

MORPHOLOGY: SOME CONCEPTS

• Morpheme: smallest grammatical unit in language

• Affix: morpheme that occurs only together with others

• Word = stem [ + affixes ]

radios → radio + s

• Compound: word with multiple stems

thunderstorms → thunder + storm + s

52

MORPHOLOGY: SOME CONCEPTS

Types of affix:

• prefix: un+help+ful

• suffix: taste+ful, taste+ful+ness

• infix: internal affix (e.g. Arabic)

• circumfix: prefix & suffixE.g. German: ge+kauf+t

53

MORPHOLOGY: SOME CONCEPTS

Two types of morphology:

1. Inflectional: regular patterns for word classes• Changes grammatical roles• E.g. noun cases: kauppa, kauppa+a, kaupa+n, ...

2. Derivational: creates new words• Changes meaning• E.g. diminutive suffix: tuuli → tuulo+nen

54

MORPHOLOGICAL AMBIGUITY

• Morpheme-level:• Morphemes can have multiple interpretations/uses• table: noun or verb• +s: plural noun or 3rd-person singular verb

• Structural:• Words may be split in multiple ways• unionised → union+ise+ed / un+ion+ise+ed

• Bracketing:• Same split, different bracketing structures• unlockable → (un+lock)+able / un+(lock+able)

55

USES IN NLP

A few uses of morphology in NLP:

• Morphosyntactic categorization (rough POS tagging)

• Morphological features

• Stemming/lemmatization

• Generation: apply syntax, features, agreement to base forms

the loyal princes occupied the throne in his absencedet adj noun verb det noun prep det noun

adv adjprn

56

USES IN NLP

A few uses of morphology in NLP:

• Morphosyntactic categorization (rough POS tagging)

• Morphological features

• Stemming/lemmatization

• Generation: apply syntax, features, agreement to base forms

the loyal princes occupied the throne in his absencedef-sg pl past def-sg sg m-sg sgdef-pl pst-prt def-pl

56

Page 7: TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and tools across pipelines Improvements on sub-tasks bene t many applications Publicly

USES IN NLP

A few uses of morphology in NLP:

• Morphosyntactic categorization (rough POS tagging)

• Morphological features

• Stemming/lemmatization

• Generation: apply syntax, features, agreement to base forms

the loyal princes occupied the throne in his absencethe loyal prince occupy the throne in his absence

56

USES IN NLP

A few uses of morphology in NLP:

• Morphosyntactic categorization (rough POS tagging)

• Morphological features

• Stemming/lemmatization

• Generation: apply syntax, features, agreement to base forms

the loyal prince occupy the throne in prn absence+pl +pl +pl +past +sg +sg +pos +sg

+masc+sgthe loyal princes occupied the throne in his absence

56

SUMMARY

• NLU typically broken into subtasks

• Pipeline of components

• Complex applications reuse standard tools, models, datasets

• Components annotate linguistic data

• Many levels of analysis

• Comparing ready-made tools for subtasks

• Some typical sub-tasks / pipeline components

• Example pipelines

• Some available toolkits

• Refresher on regular expressions and FSAs

• Intro to morphology

57

READING MATERIAL

Introductions to:

• Speech recognition: J&M2 p319-21

• Morphology: J&M3 2.4.4 (J&M2 p79-80)

• POS tagging: J&M3 8.1, 8.3 (J&M2 p157-8, 167-9)

• Syntax & parsing: J&M3 11.1 (J&M2 p419-20, 461)

• NER: J&M3 17.1 (J&M2 p761-6)

Online introduction to Stanford CoreNLP toolkit

58

NEXT UP

After lunch:Practical assignments in BK107

9:15 – 12:00 Lectures12:00 – 13:15 Lunch

13:15 – ∼14:00 Introduction14:00 – 16:00 Practical assignments

• Building an NLP pipeline

• Errors propagating through pipeline

• Comparison of tools

• A complete application

59