92
SYNTAX Matt Post IntroHLT class 21 October 2019

SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

SYNTAX

Matt Post IntroHLT class 21 October 2019

Page 2: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Fred Jones was worn out from caring for his often screaming and crying wife during the day but he couldn’t sleep at night for fear that she in a stupor from the drugs that didn’t ease the pain would set the house ablaze with a cigarette

Page 3: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• 46 words, 46! permutations of those words, the vast majority of them ungrammatical and meaningless

• How is that we can– process and understand this sentence?– discriminate it from the sea of ungrammatical

permutations it floats in?

3

Page 4: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

4

what is syntax?

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Linguistics Computer Science

Page 5: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

5

what is syntax?

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 6: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

w w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of TorontoCM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com

SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

Graeme Hirst, Series Editor

ISBN: 978-1-62705-011-1

9 781627 050111

90000

Series ISSN: 1947-4040 BENDERLINGUISTIC FUNDAM

ENTALS FOR NATURAL LANGUAGE PROCESSINGM

ORGAN

&CLAYPO

OL

Linguistic Fundamentals forNatural Language Processing100 Essentials from Morphology and SyntaxEmily M. Bender, University of WashingtonMany NLP tasks have at their core a subtask of extracting the dependencies—who did what towhom—from natural language sentences. This task can be understood as the inverse of the problemsolved in different ways by diverse human languages, namely, how to indicate the relationship betweendifferent parts of a sentence. Understanding how languages solve the problem can be extremely usefulin both feature design and error analysis in the application of machine learning to NLP. Likewise,understanding cross-linguistic variation can be important for the design of MT systems and othermultilingual applications. The purpose of this book is to present in a succinct and accessible fashioninformation about the morphological and syntactic structure of human languages that can be usefulin creating more linguistically sophisticated, more language-independent, and thus more successfulNLP systems.

Linguistic Fundamentalsfor Natural LanguageProcessing100 Essentials fromMorphology and Syntax

Emily M. Bender

Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

w w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of TorontoCM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com

SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

Graeme Hirst, Series Editor

ISBN: 978-1-62705-011-1

9 781627 050111

90000

Series ISSN: 1947-4040 BENDERLINGUISTIC FUNDAM

ENTALS FOR NATURAL LANGUAGE PROCESSINGM

ORGAN

&CLAYPO

OL

Linguistic Fundamentals forNatural Language Processing100 Essentials from Morphology and SyntaxEmily M. Bender, University of WashingtonMany NLP tasks have at their core a subtask of extracting the dependencies—who did what towhom—from natural language sentences. This task can be understood as the inverse of the problemsolved in different ways by diverse human languages, namely, how to indicate the relationship betweendifferent parts of a sentence. Understanding how languages solve the problem can be extremely usefulin both feature design and error analysis in the application of machine learning to NLP. Likewise,understanding cross-linguistic variation can be important for the design of MT systems and othermultilingual applications. The purpose of this book is to present in a succinct and accessible fashioninformation about the morphological and syntactic structure of human languages that can be usefulin creating more linguistically sophisticated, more language-independent, and thus more successfulNLP systems.

Linguistic Fundamentalsfor Natural LanguageProcessing100 Essentials fromMorphology and Syntax

Emily M. Bender

Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

w w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of TorontoCM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com

SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

Graeme Hirst, Series Editor

ISBN: 978-1-62705-011-1

9 781627 050111

90000

Series ISSN: 1947-4040 BENDERLINGUISTIC FUNDAM

ENTALS FOR NATURAL LANGUAGE PROCESSINGM

ORGAN

&CLAYPO

OL

Linguistic Fundamentals forNatural Language Processing100 Essentials from Morphology and SyntaxEmily M. Bender, University of WashingtonMany NLP tasks have at their core a subtask of extracting the dependencies—who did what towhom—from natural language sentences. This task can be understood as the inverse of the problemsolved in different ways by diverse human languages, namely, how to indicate the relationship betweendifferent parts of a sentence. Understanding how languages solve the problem can be extremely usefulin both feature design and error analysis in the application of machine learning to NLP. Likewise,understanding cross-linguistic variation can be important for the design of MT systems and othermultilingual applications. The purpose of this book is to present in a succinct and accessible fashioninformation about the morphological and syntactic structure of human languages that can be usefulin creating more linguistically sophisticated, more language-independent, and thus more successfulNLP systems.

Linguistic Fundamentalsfor Natural LanguageProcessing100 Essentials fromMorphology and Syntax

Emily M. Bender

Page 7: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

What is syntax?• A set of constraints on the possible sentences in the

language– *A set of constraint on the possible sentence.– *Dipanjan asked [a] question.– *You are on class.

• A finite set of rules licensing an infinite number of strings

7

Page 8: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

What isn’t syntax?• A “scaffolding for meaning” (Weds), but not the same as

meaning

8

grammaticalmeaningful

grammaticalmeaningless

ungrammaticalmeaningful

ungrammaticalmeaningless

Page 9: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Parts of Speech (POS)• Three definitions of noun

9

Grammar school (“metaphysical”)a person, place, thing, or idea

Page 10: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Parts of Speech (POS)• Three definitions of noun

9

Grammar school (“metaphysical”)a person, place, thing, or idea

Distributional the set of words that have the same distribution as other nouns

{I,you,he} saw the {bird,cat,dog}.

Page 11: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Parts of Speech (POS)• Three definitions of noun

9

Grammar school (“metaphysical”)a person, place, thing, or idea

Functional the set of words that serve as arguments to verbs

verb

noun adverb

adjective

Distributional the set of words that have the same distribution as other nouns

{I,you,he} saw the {bird,cat,dog}.

Page 12: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

POS Examples• Collapsed form: single POS collects morphological

properties (number, gender, case)– NN, NNS, NNP, NNPS– RB, RBR, RBS, RP– VB, VBD, VBG, VBN, VBP, VBZ

• This works fine…in English

10

Page 13: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• Collapsing morphological properties doesn’t work so well in other languages

• Attribute-value: morph. properties factored out– Haus: N[case=nom,number=1,gender=neuter]– Hauses: N[case=genitive,number=1,gender=neuter]

• In general:– Parts of speech are not universal– The finer-grained the parts and attributes are, the more

language-specific they are– Coarse categories will cover more languages

11

Page 14: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

12

A Universal Part-of-Speech Tagset

Slav Petrov1

Dipanjan Das2

Ryan McDonald1

1Google Research, New York, NY, USA, {slav,ryanmcd}@google.com2Carnegie Mellon University, Pittsburgh, PA, USA, [email protected]

Abstract

To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset thatconsists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsetsto this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a datasetconsisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via three experiments, that (1)compare tagging accuracies across languages, (2) present an unsupervised grammar induction approach that does not use gold standardpart-of-speech tags, and (3) use the universal tags to transfer dependency parsers between languages, achieving state-of-the-art results.

Keywords: Part-of-Speech Tagging, Multilinguality, Annotation Guidelines

1. Introduction

Part-of-speech (POS) tagging has received a great dealof attention as it is a critical component of most natu-ral language processing systems. As supervised POS tag-ging accuracies for English (measured on the PennTree-bank (Marcus et al., 1993)) have converged to around97.3% (Toutanova et al., 2003; Shen et al., 2007; Manning,2011), the attention has shifted to unsupervised approaches(Christodoulopoulos et al., 2010). In particular, there hasbeen growing interest in both multi-lingual POS induction(Snyder et al., 2009; Naseem et al., 2009) and cross-lingualPOS induction via projections (Yarowsky and Ngai, 2001;Xi and Hwa, 2005; Das and Petrov, 2011).Underlying these studies is the idea that a set of (coarse)syntactic POS categories exists in a similar form across lan-guages. These categories are often called universals to rep-resent their cross-lingual nature (Carnie, 2002; Newmeyer,2005). For example, Naseem et al. (2009) use the Multext-East (Erjavec, 2004) corpus to evaluate their multi-lingualPOS induction system, because it uses the same tagset formultiple languages. When corpora with common tagsetsare unavailable, a standard approach is to manually define amapping from language and treebank specific fine-grainedtagsets to a predefined universal set. This is the approachtaken by Das and Petrov (2011) to evaluate their cross-lingual POS projection system.To facilitate future research and to standardize best-practices, we propose a tagset that consists of twelve uni-versal POS categories. While there might be some con-troversy about what the exact tagset should be, we feelthat these twelve categories cover the most frequent part-of-speech that exist in most languages. In addition to thetagset, we also develop a mapping from fine-grained POStags for 25 different treebanks to this universal set. Asa result, when combined with the original treebank data,this universal tagset and mapping produce a dataset consist-ing of common parts-of-speech for 22 different languages.1Both the tagset and mappings are made available for down-

1We include mappings for two different Chinese, German andJapanese treebanks.

load at http://code.google.com/p/universal-pos-tags/.This resource serves multiple purposes. First, as mentionedpreviously, it is useful for building and evaluating unsuper-vised and cross-lingual taggers and parsers. Second, it per-mits for a better comparison of accuracy across languagesfor supervised taggers. Statements of the form “POS tag-ging for language X is harder than for language Y” arevacuous when the tagsets used for the two languages areincomparable (not to mention of different cardinality). Fi-nally, it also permits language technology practitioners totrain POS taggers with common tagsets across multiple lan-guages. This in turn facilitates downstream application de-velopment as there is no need to maintain language specificrules or systems due to differences in treebank annotationguidelines.In this paper, we specifically highlight three use cases ofthis resource. First, using our universal tagset and map-ping, we run an experiment comparing POS tagging accu-racies for 25 different treebanks on a single tagset. Second,we combine the cross-lingual projection part-of-speech tag-gers of Das and Petrov (2011) with the grammar inductionsystem of Naseem et al. (2010) – which requires a univer-sal tagset – to produce a completely unsupervised grammarinduction system for multiple languages, that does not re-quire gold POS tags or any other type of manual annotationin the target language. Finally, we show that a delexicalizedEnglish parser, whose predictions rely solely on the univer-sal POS tags of the input sentence, can be used to parsea foreign language POS sequence, achieving higher accu-racies than state-of-the-art unsupervised parsers. These ex-periments highlight that our universal tagset captures a sub-stantial amount of information and carries that informationover across languages boundaries.

2. Tagset

While there might be some disagreement about the exactdefinition of an universal POS tagset (Evans and Levinson,2009), several scholars have argued that a set of coarse POScategories (or syntactic universals) exists across languagesin one form or another (Carnie, 2002; Newmeyer, 2005).Rather than attempting to define an ’a priori’ or ’inherent’

2089

A Universal Part-of-Speech TagsetPetrov et al. (LREC 2012) http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf

Unimorphunimorph.org unimorph.github.io

• Two efforts

Page 15: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Phrases and Constituents• Longer sequences of words can perform the same

function as individual parts of speech:– I saw [a kid]– I saw [a kid playing basketball]– I saw [a kid playing basketball alone on the court]

• This gives rise to the idea of a phrasal constituent, which function as a unit in relation to the rest of the sentence

13

Page 16: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• Tests (Bender #51)– coordination

∎ Kim [read a book], [gave it to Sandy], and [left].– substitution with a word

∎ Kim read [a very interesting book about grammar].∎ Kim read [it].

14

Page 17: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Heads, arguments, & adjuncts• Head: “the sub-constituent which determines the internal

structure and external distribution of the constituent as a whole” (Bender #52)– Kim planned [to give Sandy books].– *Kim planned [to give Sandy].– Kim planned [to give books].– *Kim planned [to see Sandy books].– Kim [would [give Sandy books]].– Pat [helped [Kim give Sandy books]].– *[[Give Sandy books] [surprised Kim]].

15

Page 18: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• Dependents of a head:– Arguments: selected/licensed by the head and

complete the meaning– Adjuncts: not selected and refine the meaning

• Examples– ADJ

∎ Kim is [readyADJ [to make a pizza]V].∎ *Kim is [tiredADJ [to make a pizza]V].

– N∎ [The [red]ADJ ball]∎ *[The [red]ADJ ball [the stick]N]

16

Page 19: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Formalisms• Phrase-structure and dependency grammars

– Phrase-structure: encodes the phrasal components of language

– Dependency grammars encode the relationships between words

17

Page 20: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• Phrase / constituent structure “Dick Darman, call your office.”

18

Page 21: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• Dependency structure “Dick Darman, call your office.”

19

Page 22: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Summary

20

what is syntax?A finite set of rules licensing an infinite number of strings

We don’t know the rules, but we know that they exist, and native speaker judgments can be used to empirically explore them

Page 23: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

21

what is syntax?

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 24: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Treebanks• Collections of natural text that are annotated according to

a particular syntactic theory– Ideally as large as possible– Usually annotated by linguistic experts– Theories are usually coarsely divided into constituent or

dependency structure

22

Page 25: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Dependency Treebanks

• Dependency trees annotated across languages in a consistent manner

23

https://universaldependencies.org

Page 26: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Penn Treebank (1993)

24

http

s://c

atal

og.ld

c.up

enn.

edu/

LDC

99T4

2

Page 27: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

The Penn Treebank• Syntactic annotation of a million words of the 1989 Wall

Street Journal plus other corpora– People often discuss “The Penn Treebank” when the

mean the WSJ portion of it• Contains 74 total tags: 36 parts of speech, 7 punctuation

tags, and 31 phrasal constituent tags, plus some relation markings

• Was the foundation for an entire field of research and applications for over twenty years

25

Page 28: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

http

s://c

omm

ons.

wik

imed

ia.o

rg/w

iki/F

ile:P

ierre

Vink

en.jp

g

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

Page 29: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

x 49,208

Page 30: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Grammar (one definition)• A productive mechanism that tells a story about how a

Treebank was produced• How was this tree produced?

28

Page 31: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• One story:– S → NP , NP VP .– NP → NNP NNP– , → ,– NP → *– VP → VB NP– NP → PRP$ NN– . → .

• This is a top-down, generative story

29

Page 32: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 33: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 34: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

– Start with TOP

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 35: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

– Start with TOP– For each leaf nonterminal:

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 36: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

– Start with TOP– For each leaf nonterminal:

∎ Sample a rule from the set of rules for that nonterminal

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 37: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

– Start with TOP– For each leaf nonterminal:

∎ Sample a rule from the set of rules for that nonterminal

∎ Replace it with

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 38: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

– Start with TOP– For each leaf nonterminal:

∎ Sample a rule from the set of rules for that nonterminal

∎ Replace it with∎ Recurse

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 39: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Context Free Grammar• Nonterminals are rewritten

based on the lefthand side alone• Algorithm:

– Start with TOP– For each leaf nonterminal:

∎ Sample a rule from the set of rules for that nonterminal

∎ Replace it with∎ Recurse

• Terminates when there are no more nonterminals

30

Turing machine

context-sensitive grammar

context free grammar

finite state machine

Chomsky formal language hierarchy

Page 40: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

31

Page 41: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → S

31

Page 42: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VP

31

Page 43: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VPVP VP → (VB→halt) NP PP

31

Page 44: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VPVP VP → (VB→halt) NP PPhalt NP PP NP → (DT The)

(JJ→market-jarring) (CD→25)

31

Page 45: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VPVP VP → (VB→halt) NP PPhalt NP PP NP → (DT The)

(JJ→market-jarring) (CD→25)

halt The market-jarring 25 PP PP → (IN→at) NP

31

Page 46: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VPVP VP → (VB→halt) NP PPhalt NP PP NP → (DT The)

(JJ→market-jarring) (CD→25)

halt The market-jarring 25 PP PP → (IN→at) NPhalt The market-jarring 25 at NP NP → (DT→the) (NN→bond)

31

Page 47: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VPVP VP → (VB→halt) NP PPhalt NP PP NP → (DT The)

(JJ→market-jarring) (CD→25)

halt The market-jarring 25 PP PP → (IN→at) NPhalt The market-jarring 25 at NP NP → (DT→the) (NN→bond)halt The market-jarring 25 at the bond

31

Page 48: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

TOP TOP → SS S → VPVP VP → (VB→halt) NP PPhalt NP PP NP → (DT The)

(JJ→market-jarring) (CD→25)

halt The market-jarring 25 PP PP → (IN→at) NPhalt The market-jarring 25 at NP NP → (DT→the) (NN→bond)halt The market-jarring 25 at the bond(TOP (S (VP (VB halt) (NP (DT The) (JJ market-jarring) (CD 25)) (PP (IN at) (NP (DT the) (NN bond))))))

31

Page 49: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Where do grammars come from?

32https://www.shutterstock.com/image-vector/stork-carrying-baby-boy-133823486

Page 50: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• The Treebank!– Depending on the formalism, it can be read from

annotated treebanks– Might require additional information

∎ e.g., head rules for a dependency grammar conversion

– This defines a model of how the Treebank was produced

33

Page 51: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Probabilities• A useful addition:

– S → NP , NP VP . [0.002]– NP → NNP NNP [0.037]– , → , [0.999]– NP → * [X]– VP → VB NP [0.057]– NP → PRP$ NN [0.008]– . → . [0.987]

• This is a probabilistic, top-down, generative story

•Can also be taken from Treebanks P(X) = ∑

X′ ∈N

P(X)P(X′ )

34

Page 52: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Other tasks and models• Grammar induction: humans learn grammar without a

Treebank; can computers?• Lexicalized models: build richer models that account for

head-driven structure generation• Dependency conversions: define generative dependency

process with labeled arcs• More descriptive grammars: (mildly) context-sensitive

grammars, attribute-value structures• And many more

35

Page 53: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

36

what is a grammar and where do they

come from?

A grammar is an explicit set of rules that explain how a Treebank might have been generated

Grammars come from linguists, either indirectly (via a formalism applied to a Treebank) or directly

Page 54: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

37

what is syntax?

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 55: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Parsing• How do we transformFred Jones was worn out from caring for his often screaming and crying wife during the day but he couldn’t sleep at night for fear that she in a stupor from the drugs that didn’t ease the pain would set the house ablaze with a cigarette.

38

Page 56: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials
Page 57: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials
Page 58: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• One story:– S → NP , NP VP .– NP → NNP NNP– , → ,– NP → *– VP → VB NP– NP → PRP$ NN– . → .

• This is a top-down, generative story

40

Page 59: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrow

Page 60: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

Page 61: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

NP→DT NNNP→NN

Page 62: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

NP→DT NNNP→NN

PP→2IN3 3NP5NP→NN NN

Page 63: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

NP→DT NNNP→NN

PP→2IN3 3NP5NP→NN NNVP→2VB3 3NP5

Page 64: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

NP→DT NNNP→NN

PP→2IN3 3NP5NP→NN NN

VP→VB PP

VP→2VB3 3NP5

Page 65: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

NP→DT NNNP→NN

PP→2IN3 3NP5NP→NN NN

VP→VB PP

VP→2VB3 3NP5

S → 0NP1 1VP5

Page 66: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Chart Parsing• Cocke-Younger-Kasami (CYK / CKY algorithm)

410 1 2 3 4 5

Time flies like an arrowNN NN,VB VB,IN DT NN

NP→DT NNNP→NN

PP→2IN3 3NP5NP→NN NN

VP→VB PP

VP→2VB3 3NP5

S → 0NP1 1VP5

S → 0NP2 2VP5

Page 67: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Complexity analysis• What is the running time of CKY

– as a function of input sentence length?– as a function of the number of rules in the grammar?

42

Page 68: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

43

how can a computer find a sentence’s

structure?

For context-free grammars, the (weighted) CKY algorithm can be used to find the most probable (maximum a posterior) tree given a certain grammar

Page 69: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

44

what is syntax?

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 70: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Uses of trees• Semantic role labeling (Weds)• Machine translation• Today: measuring syntactic diversity

45

Page 71: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Syntactic diversity• How many ways are there to rephrase a sentence while

retaining its meaning? Fred Jones was worn out from caring for his often screaming and crying wife

46

Page 72: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• Suppose we had a paraphrase system that could rewrite this system– Fred Jones was tired from caring for his often

screaming and crying wife– Fred Jones was worn out from caring for his frequently

screaming and crying wife– Fred Jones was worn out from caring for his often

screaming and crying spouse• To help train this system, we’d like a diversity metric

– Fred Jones’ wife’s frequent yelling and crying brought him to the brink of exhaustion.

47

Page 73: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Tree kernels• A way to compare how similar trees are by counting how

many tree fragments they have in common

48

Page 74: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Tree kernels• A way to compare how similar trees are by counting how

many tree fragments they have in common

48

(S NP (VP (VBD was) VP))

Page 75: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Tree kernels• A way to compare how similar trees are by counting how

many tree fragments they have in common

48

(S NP (VP (VBD was) VP))

(S (VP (VBG caring) (PP (IN for) NP)))

Page 76: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Tree kernels• A way to compare how similar trees are by counting how

many tree fragments they have in common

48

(S NP (VP (VBD was) VP))

(S (VP (VBG caring) (PP (IN for) NP)))

(NP PRP$ ADJP CC NN (NN wife))

Page 77: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments?• how many fragments in common?

49

Page 78: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Algorithm• Node score:

– 0 if – 1 if the rules are the same and are terminal rules

– otherwise

• Kernel score

Δ(n1, n2) =rule(n1) ≠ rule(n2)

|n1|

∏j=1

(1 + Δ(n1j, n2j))

K(T1, T2) = ∑n1∈T1

∑n2∈T2

Δ(n1, n2)

50

Page 79: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

Page 80: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

K( ),

Page 81: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

K( ),

= Δ(NP1, NP2) + Δ(DT1, DT2) + Δ(JJ1, JJ2) + Δ(NN1, NN2)+Δ(NP1, DT2) + Δ(NP1, JJ2) + Δ(NP2, NN2)+ . . .

Page 82: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

K( ),

= Δ(NP1, NP2) + 1 + 0 + 1

= Δ(NP1, NP2) + Δ(DT1, DT2) + Δ(JJ1, JJ2) + Δ(NN1, NN2)+Δ(NP1, DT2) + Δ(NP1, JJ2) + Δ(NP2, NN2)+ . . .

Page 83: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

K( ),

= Δ(NP1, NP2) + 1 + 0 + 1

= Δ(NP1, NP2) + Δ(DT1, DT2) + Δ(JJ1, JJ2) + Δ(NN1, NN2)+Δ(NP1, DT2) + Δ(NP1, JJ2) + Δ(NP2, NN2)+ . . .

= (1 + Δ(DT1, DT2) ⋅ (1 + Δ(JJ1, JJ2) ⋅ (1 + Δ(NN1, NN2)) + 2

Page 84: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

K( ),

= Δ(NP1, NP2) + 1 + 0 + 1

= Δ(NP1, NP2) + Δ(DT1, DT2) + Δ(JJ1, JJ2) + Δ(NN1, NN2)+Δ(NP1, DT2) + Δ(NP1, JJ2) + Δ(NP2, NN2)+ . . .

= (1 + 1) ⋅ (1 + 0) ⋅ (1 + 1) + 2= (1 + Δ(DT1, DT2) ⋅ (1 + Δ(JJ1, JJ2) ⋅ (1 + Δ(NN1, NN2)) + 2

Page 85: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

• how many fragments in common?

51

K( ),

= Δ(NP1, NP2) + 1 + 0 + 1

= Δ(NP1, NP2) + Δ(DT1, DT2) + Δ(JJ1, JJ2) + Δ(NN1, NN2)+Δ(NP1, DT2) + Δ(NP1, JJ2) + Δ(NP2, NN2)+ . . .

= (1 + 1) ⋅ (1 + 0) ⋅ (1 + 1) + 2= (1 + Δ(DT1, DT2) ⋅ (1 + Δ(JJ1, JJ2) ⋅ (1 + Δ(NN1, NN2)) + 2

= 6

Page 86: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Details

52

Making Tree Kernels practical for Natural Language Learning

Alessandro MoschittiDepartment of Computer ScienceUniversity of Rome ”Tor Vergata”

Rome, [email protected]

Abstract

In recent years tree kernels have been pro-

posed for the automatic learning of natural

language applications. Unfortunately, they

show (a) an inherent super linear complex-

ity and (b) a lower accuracy than tradi-

tional attribute/value methods.

In this paper, we show that tree kernels

are very helpful in the processing of nat-

ural language as (a) we provide a simple

algorithm to compute tree kernels in linear

average running time and (b) our study on

the classification properties of diverse tree

kernels show that kernel combinations al-

ways improve the traditional methods. Ex-

periments with Support Vector Machines

on the predicate argument classification

task provide empirical support to our the-

sis.

1 Introduction

In recent years tree kernels have been shown to

be interesting approaches for the modeling of syn-

tactic information in natural language tasks, e.g.

syntactic parsing (Collins and Duffy, 2002), rela-

tion extraction (Zelenko et al., 2003), Named En-

tity recognition (Cumby and Roth, 2003; Culotta

and Sorensen, 2004) and Semantic Parsing (Mos-

chitti, 2004).

The main tree kernel advantage is the possibility

to generate a high number of syntactic features and

let the learning algorithm to select those most rel-

evant for a specific application. In contrast, their

major drawback are (a) the computational time

complexity which is superlinear in the number of

tree nodes and (b) the accuracy that they produce is

often lower than the one provided by linear models

on manually designed features.

To solve problem (a), a linear complexity al-

gorithm for the subtree (ST) kernel computation,

was designed in (Vishwanathan and Smola, 2002).

Unfortunately, the ST set is rather poorer than the

one generated by the subset tree (SST) kernel de-

signed in (Collins and Duffy, 2002). Intuitively,

an ST rooted in a node n of the target tree always

contains all n’s descendants until the leaves. This

does not hold for the SSTs whose leaves can be

internal nodes.

To solve the problem (b), a study on different

tree substructure spaces should be carried out to

derive the tree kernel that provide the highest ac-

curacy. On the one hand, SSTs provide learn-

ing algorithms with richer information which may

be critical to capture syntactic properties of parse

trees as shown, for example, in (Zelenko et al.,

2003; Moschitti, 2004). On the other hand, if the

SST space contains too many irrelevant features,

overfitting may occur and decrease the classifica-

tion accuracy (Cumby and Roth, 2003). As a con-

sequence, the fewer features of the ST approach

may be more appropriate.

In this paper, we aim to solve the above prob-

lems. We present (a) an algorithm for the eval-

uation of the ST and SST kernels which runs in

linear average time and (b) a study of the impact

of diverse tree kernels on the accuracy of Support

Vector Machines (SVMs).

Our fast algorithm computes the kernels be-

tween two syntactic parse trees in O(m + n) av-

erage time, where m and n are the number of

nodes in the two trees. This low complexity al-

lows SVMs to carry out experiments on hundreds

of thousands of training instances since it is not

higher than the complexity of the polynomial ker-

113

Making Tree Kernels Practical for Natural Language Learning Alessandro Moschitti EACL 2006 https://www.aclweb.org/anthology/E06-1015/

Page 87: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today we will cover

53

what can parse trees be used for?

Parse trees are useful in a wide range of tasks

One application, tree kernels, can be used to compare how similar two trees are by looking at all possible fragments between them

Page 88: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today you were to learn

54

what is syntax?

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 89: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today you were to learn

55

syntax is the study of the structure of

language

what is a grammar and where do they

come from?

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 90: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Today you were to learn

56

syntax is the study of the structure of

language

grammars usually provide generative stories and can be

learned from Treebanks

how can a computer find a sentence’s

structure?

what can parse trees be used for?

Computer ScienceLinguistics

Page 91: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Summary

57

syntax is the study of the structure of

language

grammars usually provide generative stories and can be

learned from Treebanks

trees can be produced by parsing a sentence

with a grammar

what can parse trees be used for?

Computer ScienceLinguistics

Page 92: SYNTAX - GitHub Pages · 2019-12-05 · LINGUISTIC FUNDAMENTALS FOR NATURAL LANGUAGE PROCESSING MORGAN & CLAYPOOL Linguistic Fundamentals for Natural Language Processing 100 Essentials

Summary

58

syntax is the study of the structure of

language

grammars usually provide generative stories and can be

learned from Treebanks

trees can be produced by parsing a sentence

with a grammar

trees are useful in many applications, including

testing syntactic diversity

Computer ScienceLinguistics