A Data-Oriented Model of Literary Language

A Data-Oriented Modelof Literary Language

Andreas van Cranenburgh Rens Bod

Institut für Sprache und Information Institute for Logic, Language andHeinrich Heine University Düsseldorf Computation, University of Amsterdam

EACL, Valencia, April 7, 2017

This talk

Characterizing Literary Language:I What makes a literary novel literary?I Can a model predict this?

Specifically . . .

Research Questionare there particular textual conventions in literary novelsthat contribute to readers judging them to be literary?

This talk

Characterizing Literary Language:I What makes a literary novel literary?I Can a model predict this?

Specifically . . .


Background

DefinitionLiterature is the body of work with the most artistic orimaginative fine writing (Britannica, 1911).

I Demarcation problemI Some argue text is irrelevant,

only context/prestige matters

I Therefore, interesting to quantify influence of textI NB: not the same as success, popularity, quality, &c.

Background

DefinitionLiterature is the body of work with the most artistic orimaginative fine writing (Britannica, 1911).

I Demarcation problemI Some argue text is irrelevant,

only context/prestige matters

I Therefore, interesting to quantify influence of textI NB: not the same as success, popularity, quality, &c.

The Riddle of Literary Quality

Corpus:I 401 recent Dutch novels (translated & original)I Published 2007–2012I Selected by popularity

Contrast: Gutenberg, Google BooksI more books (thousands, millions)I not representative (volunteer work, digital availability)I not contemporary (19th century)

cf. Pechenick et al. (2015), PloS ONE. Characterizing the Google Books Corpus:Strong Limits [. . . ]

http://www.literaryquality.huygens.knaw.nl

http://dx.doi.org/10.1371/journal.pone.0137041



The Riddle of Literary Quality

Corpus:I 401 recent Dutch novels (translated & original)I Published 2007–2012I Selected by popularity

Contrast: Gutenberg, Google BooksI more books (thousands, millions)I not representative (volunteer work, digital availability)I not contemporary (19th century)

cf. Pechenick et al. (2015), PloS ONE. Characterizing the Google Books Corpus:Strong Limits [. . . ]





Survey ratings: 401 novels; N=14k

Definitelynot literary Not literary

Tendingtowards

non-literary Borderingliterary andnon-literary

Somewhatliterary Literary Highly literary

James, 5

0 shades

Child, 6

1 hours

Smith

, Those

in peril

Austin, U

ntil we re

ach..

Adler-Olse

n, Consp

iracy o

f..

Watson, B

efore I go..

Scholte

n, Kameraad baron

Beijnum, S

oort familie

Barnes, S

ense of e

nding1

2

3

4

5

6

7

mea

n ra

ting,

95%

con

f. in

t. Constraints:I ≥ 50 ratingsI ≥ 2000 sent.

369 novels remain

91 % novelsconf. int. < 0.5

http://www.hetnationalelezersonderzoek.nl




Tendingtowards



James, 5

0 shades

Child, 6

1 hours

Smith

, Those

in peril

Austin, U

ntil we re

ach..

Adler-Olse

n, Consp

iracy o

f..

Watson, B

efore I go..

Scholte

n, Kameraad baron

Beijnum, S

oort familie

Barnes, S

ense of e

nding1

2

3

4

5

6

7

mea

n ra

ting,

95%

con

f. in

t.

Constraints:I ≥ 50 ratingsI ≥ 2000 sent.

369 novels remain






Tendingtowards



James, 5

0 shades

Child, 6

1 hours

Smith

, Those

in peril

Austin, U

ntil we re

ach..

Adler-Olse

n, Consp

iracy o

f..

Watson, B

efore I go..

Scholte

n, Kameraad baron

Beijnum, S

oort familie

Barnes, S

ense of e

nding1

2

3

4

5

6

7

mea

n ra

ting,

95%

con

f. in

t. Constraints:I ≥ 50 ratingsI ≥ 2000 sent.

369 novels remain




Overview

2.1...

4.7 . . .

6.6

Document-feature matrixsurveyrating

369novels

task: predict

9.1 3 7.0 ......

17.9 4 6.1 ... . . .

14.1 ...

50 shades..

eat pray love

super highbrow stuff

sent.len. BoW genre...

Experimental setup

Task: predict mean literary rating (1–7)Training data: 1000 sentences per novelEvaluation metric: R2 (≈ % variation explained,

baseline=0.0, perfect=100 %)Show incremental improvement with eachtype of feature.

Simple Stylistic Measures

R2

Mean sent. len.

16.4

+ % Direct speech

23.1

+ % Basic vocab. (top 3000 words)

23.5

+ Compression ratio (bzip2)

24.4

+ Cliche expressions

30.0

Table: Basic features

, incremental scores.

Simple Stylistic Measures

R2

Mean sent. len. 16.4+ % Direct speech 23.1+ % Basic vocab. (top 3000 words) 23.5+ Compression ratio (bzip2) 24.4+ Cliche expressions 30.0

Table: Basic features, incremental scores.

Strong lexical baselines

Setup: Linear Support Vector Regression,5-fold crossvalidation

R2

Basic features

30.0

+ LDA: 50 topic weights

52.2

+ Word bigrams

59.5

+ Char. 4-grams

59.9

On average,I 59.9 % of variation in ratings (R2) is explained

using basic and lexical features.I the prediction is off by 0.64 (RMSE) out of 0–7.

Strong lexical baselines

Setup: Linear Support Vector Regression,5-fold crossvalidation

R2

Basic features 30.0+ LDA: 50 topic weights 52.2+ Word bigrams 59.5+ Char. 4-grams 59.9

On average,I 59.9 % of variation in ratings (R2) is explained

using basic and lexical features.I the prediction is off by 0.64 (RMSE) out of 0–7.

n-gram limitations

1. fixed n:no MWE, long-distance relations

2. no linguistic abstraction:e.g., syntactic categories, grammatical functions

3. small features:harder to interpret

I Larger features⇒ combinatorial explosionI Use data-driven feature selection

n-gram limitations

1. fixed n:no MWE, long-distance relations

2. no linguistic abstraction:e.g., syntactic categories, grammatical functions

3. small features:harder to interpret

I Larger features⇒ combinatorial explosionI Use data-driven feature selection

Recurring Tree FragmentsI Syntactic tree fragments of arbitrary size

(connected subsets of tree productions)I Extract automatically from training data:

find overlapping parts of parse treesI Apply cross-validationI Feature selection using correlation with literary rating

fold 1 fold 2 fold 3 fold 4 fold 5

Example fragments

NP-obj1

. . .N-hd

. . .

LET

,

. . .

ROOT

LET

’

SMAIN

NP-su

. . .

SMAIN

. . .

LET

.

PP-mod

VZ-hd

. . .

NP-obj1

LID-det

een

N-hd

. . .

1 2 3 4 5 6 7

mean literary rating

0

20

40

60

80

100

120

fragm

ent

count

r=0.526

1 2 3 4 5 6 7


0

2

4

6

8

10

12

fragm

ent

count

r=-0.417

1 2 3 4 5 6 7


0

10

20

30

40

50

60

fragm

ent

count

r= 0.4

Results w/Fragments

R2

Basic features 30.0+ LDA: 50 topic weights 52.2+ Word bigrams 59.5+ Char. 4-grams 59.9+ Syntactic fragments 62.2

I Syntax gives modest performance improvementI However, features are linguistically more interesting

Results w/Fragments

R2

Basic features 30.0+ LDA: 50 topic weights 52.2+ Word bigrams 59.5+ Char. 4-grams 59.9+ Syntactic fragments 62.2

I Syntax gives modest performance improvementI However, features are linguistically more interesting

Analysis of tree fragments

Fragments positively correlated w/literary ratings:I Many small fragmentsI Indicators of more complex syntax, e.g.:

appositive NPs:His name was Adrian Finn, a tall, shy boy who [. . . ](Barnes, Sense of an ending)

complex, nested NPs/PPs:[. . . ] a whole storetank of existential rage(Barnes, Sense of an ending)

discontinuous constituents:‘Miss Aibagawa,’ declared Ogawa, ’is a midwife.’(Mitchell, Thousand autumns of J. Zoet)

Metadata

Coarse genre: Fiction, Suspense, Romance, OtherTranslated vs. originally DutchAuthor gender: male, female, mixed/unknown

R2

Basic features 30.0+ Auto. induced feat. 61.2+ Genre 74.3+ Translated 74.0+ Author gender 76.0

Table: Metadata features; incremental scores.

Metadata

Coarse genre: Fiction, Suspense, Romance, OtherTranslated vs. originally DutchAuthor gender: male, female, mixed/unknown

R2

Basic features 30.0+ Auto. induced feat. 61.2+ Genre 74.3+ Translated 74.0+ Author gender 76.0

Table: Metadata features; incremental scores.

Prediction scatter plot

1 2 3 4 5 6 7

actual reader judgments

1

2

3

4

5

6

7pre

dic

ted r

eader

judgm

ents

JamesFifty shades of Grey

KinsellaRemember me?

SmeetsAfrekening

GilbertEat pray love

KochThe dinner

StockettThe help

DonoghueRoom

FranzenFreedom

BarnesSense of anending

LewinskyJohannistag

MortierGodenslaap

RosenboomZoete mond

BaldacciHell's corner

FrenchBlue monday

Fiction

Suspense

Romantic

Other

Conclusion


I Yes! Literary conventions are non-arbitrarybecause they are associated with textual features

I Literariness can be predicted from textto a large extent: text-intrinsic literariness

I Cumulative improvements with ensemble of featuresI Robust result: both coarse & fine rating differences

are predictedI Literature is characterized by a larger inventory of

lexico-syntactic constructions

Conclusion


I Yes! Literary conventions are non-arbitrarybecause they are associated with textual features

I Literariness can be predicted from textto a large extent: text-intrinsic literariness

I Cumulative improvements with ensemble of featuresI Robust result: both coarse & fine rating differences

are predictedI Literature is characterized by a larger inventory of

lexico-syntactic constructions

THE ENDDissertation & code: http://andreasvc.github.io

Figure: Huff (1954). How to lie with statistics.

http://andreasvc.github.io

BUT WAIT, THERE’S MORE

Fragment size (non-terminals)

1 3 5 7 9 11 13 15 17 19 21

fragment size (non-terminals)

0

200

400

600

800

1000

1200

1400

num

ber

of

fragm

ents

positive corr.

negative corr.

Syntactic category of root node

NP PP

SMAINSSUB

CONJ n DUROOT

ww INF adj

PPART CP bwvnw

category of root node

0

200

400

600

800

1000

1200num

ber

of

fragm

ents

positive corr.

negative corr.

Function tag of root node

mod

(none)

obj1 hdbody dp su vc

nucl pc ldpre

dc sat

app det

function tag of root node

0

200

400

600

800

1000

1200

1400num

ber

of

fragm

ents

positive corr.

negative corr.

1. n-hd , r=0.522. NP-su SMAIN-dp , SMAIN-dp r=0.463. lid-det n-hd r=0.424. lid-det NP-app r=0.415. SMAIN-dp DU . r=0.416. vz-hd CONJ-obj1 NP-obj1 r=0.417. ww-hd NP-su r=0.418. lid-det n-hd r=0.419. [SMAIN-dp . . . , . . . ] r=0.4110. In r=0.41

7770. ? r=-0.327771. ’ tsw-tag DU . r=-0.337772. NP-su r=-0.347773. vnw-hd r=-0.347774. echt r=-0.347775. Oké r=-0.347776. ’ Ik SMAIN . r=-0.357777. ’ DU . r=-0.397778. ’ NP-su SMAIN . r=-0.407779. ww-hd adj-mod r=-0.43

Documents

A Data-Oriented Model of Literary Language