43
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science Foundation and the Office of Naval Research 1 LightSIDE

Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

  • Upload
    urbano

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

LightSIDE. Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science Foundation and the Office of Naval Research. l ightsidelabs.com/research/. Click here to load a file. - PowerPoint PPT Presentation

Citation preview

Page 1: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

1

Carolyn Penstein RoséLanguage Technologies InstituteHuman-Computer Interaction InstituteSchool of Computer Science

With funding from the National Science Foundation and the Office of Naval Research

LightSIDE

Page 2: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

2

lightsidelabs.com/research/

Page 3: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

3

Page 4: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

4

Click here to load a file

Page 5: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

5

Select Heteroglossia as the predicted category

Page 6: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

6

Make sure the text field is selected to extract text features from

Page 7: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Punctuation can be a “stand in” for mood “you think the answer is 9?” “you think the answer is 9.”

Bigrams capture simple lexical patterns “common denominator” versus “common multiple”

Trigrams (just like bigrams, but with 3 words next to each other) Carnegie Mellon University

POS bigrams capture syntactic or stylistic information “the answer which is …” vs “which is the answer”

Line length can be a proxy for explanation depth

Feature Space Customizations

Page 8: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Contains non-stop word can be a predictor of whether a conversational contribution is contentful “ok sure” versus “the common denominator”

Remove stop words removes some distracting featuresStemming allows some generalization

Multiple, multiply, multiplicationRemoving rare features is a cheap form of feature

selection Features that only occur once or twice in the corpus won’t generalize, so

they are a waste of time to include in the vector space

Feature Space Customizations

Page 9: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Feature Space Customizations

Think like a computer!Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful

Look for approximations If you want to find questions, you don’t need to do a complete

syntactic analysis Look for question marks Look for wh-terms that occur immediately before an auxilliary

verb

Page 10: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

10

Click to extract text features

Page 11: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

11

Select Logistic Regression as the Learner

Page 12: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

12

Evaluate result by cross validation over sessions

Page 13: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

13

Run the experiment

Page 14: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

14

Page 15: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Stretchy Patterns(Gianfortoni, Adamson, & Rosé, 2011)

A sequence of 1 to 6 categories May include GAPs

Can cover any symbol GAP+ may cover any number

of symbols Must not begin or end with a GAP

Page 16: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

16

Page 17: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

17

Page 18: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

18

Now it’s your turn!We’ll explore some advanced features and error analysis

after the break!

Page 19: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Error Analysis Process

Identify large error cellsMake comparisons

Ask yourself how it is similar to the instances that were correctly classified with the same class (vertical comparison)

How it is different from those it was incorrectly not classified as (horizontal comparison)

PositiveNegative

Page 20: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Error Analysis on Development Set

20

Page 21: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

21

Error Analysis on Development Set

Page 22: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

22

Error Analysis on Development Set

Page 23: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

23

Error Analysis on Development Set

Page 24: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

24

Error Analysis on Development Set

Page 25: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

25

Page 26: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

26

Page 27: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

27

Positive: is interesting, an interesting scene

Negative: would have been more interesting, potentially interesting, etc.

What’s different?

Page 28: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

28

Page 29: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

29

Page 30: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

30

Page 31: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

31

Page 32: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

32

Page 33: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

33

* Note that in this case we get no benefit if we use feature selection over the original feature space.

Page 34: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

Feature Splitting (Daumé III, 2007)

34

General

Domain A Domain BGeneral

Why is this nonlinear?

It represents the interaction between each feature and the Domain variable

Now that the feature space represents the nonlinearity, the algorithm to train the weights can be linear.

Page 35: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

35

Healthcare Bill Dataset

Page 36: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

36

Healthcare Bill Dataset

Page 37: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

37

Healthcare Bill Dataset

Page 38: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

38

Healthcare Bill Dataset

Page 39: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

39

Healthcare Bill Dataset

Page 40: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

40

Healthcare Bill Dataset

Page 41: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

41

Healthcare Bill Dataset

Page 42: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

42

Healthcare Bill Dataset

Page 43: Carolyn  Penstein  Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science

43

Healthcare Bill Dataset