19
Elements of Style: Identifying Unstructured Text Data Jamie Evers • 3/28/16

Status report (7)

Embed Size (px)

Citation preview

Page 1: Status report (7)

Elements of Style: Identifying Unstructured Text Data

Jamie Evers • 3/28/16

Page 2: Status report (7)

-How something is written rather than what is being written about

Page 3: Status report (7)

-How something is written rather than what is being written about

Applications:

-Plagiarism Detection-Use as a tool for Educators-Authorship Attribution

Page 4: Status report (7)

Data and Tools:

Page 5: Status report (7)

Data and Tools:

Page 6: Status report (7)

“ By his age he should have belonged to the younger men, but

by his wealth and connections he belonged to the groups of

old and honored guests, and so he went from one group to

another. Some of the most important old men were the center

of groups which even strangers approached respectfully to

hear the voices of well-known men.”

Page 7: Status report (7)

Baseline Accuracy with Bag of Words:

- 75% accuracy for binary class problems- Accuracy declines dramatically with the

introduction of additional authors

Page 8: Status report (7)

Lexical Feature Engineering

Methods:

● Richness of Diction

● Variation in Sentence Length

“By his age he should have belonged to the younger men, but

by his wealth and connections he belonged to the groups of

old and honored guests, and so he went from one group to

another. Some of the most important old men were the center

of groups which even strangers approached respectfully to

hear the voices of well-known men.”

Page 9: Status report (7)

Lexical Feature Engineering

Methods:

● Richness of Diction

● Variation in Sentence Length

“By his age he should have belonged to the younger men, but

by his wealth and connections he belonged to the groups of

old and honored guests, and so he went from one group to

another. Some of the most important old men were the center

of groups which even strangers approached respectfully to

hear the voices of well-known men.”

Page 10: Status report (7)

Syntactic Feature Engineering

Methods:

● Frequency of used parts of speech

● Distribution of select parts of speech

Preposition Noun Adjective VerbSymbol

Page 11: Status report (7)

Syntactic Feature Engineering

[ Preposition, Modal, Noun, Verb, Symbol, Noun ]

[Verb, Noun, Modal, Noun, Preposition, Symbol ]

[ Noun, Modal, Noun, Verb, Preposition, Symbol]

[ Noun, Preposition, Symbol, Verb, Modal, Noun ]

Methods:

● Frequency of used parts of speech

● Distribution of select parts of speech

Page 12: Status report (7)

Stopwords Features

Methods:

● Use common words to identify latent patterns in style

● Avoids trivial indicators such as settings, character names, etc.

“By his age he should have belonged to the younger men, but

by his wealth and connections he belonged to the groups of

old and honored guests, and so he went from one group to

another. Some of the most important old men were the center

of groups which even strangers approached respectfully to

hear the voices of well-known men.”

Page 13: Status report (7)

Vectorized Word Features

Methods:

● Skip-gram model allows us to incorporate word context in feature space

● Make features from word embeddings by averaging the vectors for each document

By his age he should have belonged to the younger men, but

by his wealth and connections he belonged to the groups of

old and honored guests, and so he went from one group to

another. Some of the most important old men were the center

of groups which even strangers approached respectfully to

hear the voices of well-known men.

Page 14: Status report (7)

Classification: Gaussian Naive Bayes

Lexical Features

Syntax Features

Stopword Features

Word Embeddings

Page 15: Status report (7)

Lexical Features

Syntax Features

Stopword Features

Word Embeddings

Classification: Gaussian Naive Bayes

P(Gogol) = .1

Page 16: Status report (7)

Lexical Features

Syntax Features

Stopword Features

Word Embeddings

Classification: Gaussian Naive Bayes

P(Dostoevsky) = .3

Page 17: Status report (7)

Lexical Features

Syntax Features

Stopword Features

Word Embeddings

Classification: Gaussian Naive Bayes

P(Tolstoy) = .6

Page 18: Status report (7)

Results:

-Up to 96% accuracy on binary class problems

-Up to 85% accuracy with as many as 6 authors

Page 19: Status report (7)

Thank you!

github.com/jamesevers

[email protected]

http://www.jamievers.co