Upload
kristofor-nyquist
View
123
Download
0
Tags:
Embed Size (px)
Citation preview
…but reddit is a popularity contest
Given that I’m interested in a post1. What are similar posts I can read? 2. Who are the best people I can talk to?
Goal: find related DIY projects
Collecting / organizing data• Scraped reddit for do-it-yourself projects
– Collected text content from each project
i.e. title, externally linked blog post, comments, and general topic
Data conversion
• Combine all of the text to create one “document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
Data conversion
• Combine all of the text to create one “document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
Data conversion
• Combine all of the text to create one “document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
Data conversion
• Combine all of the text to create one “document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
Data conversion
• Combine all of the text to create one “document”
• Treat the document as a list of words
• Convert the list of words to a list of numbers
– Each number represents the “uniqueness” of a particular word to its document
i.e. 0 means word appears in every post
1 means word only appears in that post
Algorithm
Post similarity / Classification• Turn the text into a list of numbers using term-frequency-
inverse-document-frequency
• “Compress” the data for speed– ~70,000 dimensions to 80 dimensions
For similarity:
• Calculate cosine-similarity between documents
• Present user with 5 most similar posts
For classification:
• Logistic regression (L1 regularization)
Settling on 80 PCs
80 principal components somewhat arbitraryBUT overall accuracy of classifier has definitely converged…
even though 80 PCs capture ~30% of variance
Validation numbers
• NLP algorithm
– accuracy: 0.62
recalls:
auto: 0.73
electronic: 0.6
home improve: 0.54
metalwork: 0.44
other: 0.52
outdoor: 0.89
woodworking: 0.68
• Random
–accuracy: 0.17
recalls:
auto: 0.05
electronic: 0.00
home improve: 0.18
metalwork: 0.00
other: 0.08
outdoor: 0.12
woodworking: 0.32
Validating classifier
auto
metalwork
home
electronic
other
outdoor
woodwork
auto
met
alw
ork
ho
me
elec
tro
nic
oth
er
ou
tdo
or
wo
od
wo
rk
Pre
dic
tio
n
Truth
Column normalized
vs. rest ROCWoodworking
Other
Home improvement
With this application, can tolerate false positives. We can also present users with more limited options, maintaining higher accuracy