Upload
james-evers
View
94
Download
0
Embed Size (px)
Citation preview
Elements of Style: Identifying Unstructured Text Data
Jamie Evers • 3/28/16
-How something is written rather than what is being written about
-How something is written rather than what is being written about
Applications:
-Plagiarism Detection-Use as a tool for Educators-Authorship Attribution
Data and Tools:
Data and Tools:
“ By his age he should have belonged to the younger men, but
by his wealth and connections he belonged to the groups of
old and honored guests, and so he went from one group to
another. Some of the most important old men were the center
of groups which even strangers approached respectfully to
hear the voices of well-known men.”
Baseline Accuracy with Bag of Words:
- 75% accuracy for binary class problems- Accuracy declines dramatically with the
introduction of additional authors
Lexical Feature Engineering
Methods:
● Richness of Diction
● Variation in Sentence Length
“By his age he should have belonged to the younger men, but
by his wealth and connections he belonged to the groups of
old and honored guests, and so he went from one group to
another. Some of the most important old men were the center
of groups which even strangers approached respectfully to
hear the voices of well-known men.”
Lexical Feature Engineering
Methods:
● Richness of Diction
● Variation in Sentence Length
“By his age he should have belonged to the younger men, but
by his wealth and connections he belonged to the groups of
old and honored guests, and so he went from one group to
another. Some of the most important old men were the center
of groups which even strangers approached respectfully to
hear the voices of well-known men.”
Syntactic Feature Engineering
Methods:
● Frequency of used parts of speech
● Distribution of select parts of speech
Preposition Noun Adjective VerbSymbol
Syntactic Feature Engineering
[ Preposition, Modal, Noun, Verb, Symbol, Noun ]
[Verb, Noun, Modal, Noun, Preposition, Symbol ]
[ Noun, Modal, Noun, Verb, Preposition, Symbol]
[ Noun, Preposition, Symbol, Verb, Modal, Noun ]
Methods:
● Frequency of used parts of speech
● Distribution of select parts of speech
Stopwords Features
Methods:
● Use common words to identify latent patterns in style
● Avoids trivial indicators such as settings, character names, etc.
“By his age he should have belonged to the younger men, but
by his wealth and connections he belonged to the groups of
old and honored guests, and so he went from one group to
another. Some of the most important old men were the center
of groups which even strangers approached respectfully to
hear the voices of well-known men.”
Vectorized Word Features
Methods:
● Skip-gram model allows us to incorporate word context in feature space
● Make features from word embeddings by averaging the vectors for each document
By his age he should have belonged to the younger men, but
by his wealth and connections he belonged to the groups of
old and honored guests, and so he went from one group to
another. Some of the most important old men were the center
of groups which even strangers approached respectfully to
hear the voices of well-known men.
Classification: Gaussian Naive Bayes
Lexical Features
Syntax Features
Stopword Features
Word Embeddings
Lexical Features
Syntax Features
Stopword Features
Word Embeddings
Classification: Gaussian Naive Bayes
P(Gogol) = .1
Lexical Features
Syntax Features
Stopword Features
Word Embeddings
Classification: Gaussian Naive Bayes
P(Dostoevsky) = .3
Lexical Features
Syntax Features
Stopword Features
Word Embeddings
Classification: Gaussian Naive Bayes
P(Tolstoy) = .6
Results:
-Up to 96% accuracy on binary class problems
-Up to 85% accuracy with as many as 6 authors